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Preface 



Human conversational partners are able, at least to a certain extent, to detect 
the speaker’s or listener’s emotional state and may attempt to respond to it 
accordingly. When instead one of the interlocutors is a computer a number of 
questions arise, such as the following: To what extent are dialogue systems able 
to simulate such behaviors? Can we learn the mechanisms of emotional beha- 
viors from observing and analyzing the behavior of human speakers? How can 
emotions be automatically recognized from a user’s mimics, gestures and speech? 
What possibilities does a dialogue system have to express emotions itself? And, 
very importantly, would emotional system behavior be desirable at all? 

Given the state of ongoing research into incorporating emotions in dialogue 
systems we found it timely to organize a Tutorial and Research Workshop on 
Affective Dialogue Systems (ADS 2004) at Kloster Irsee in Germany during June 
14-16, 2004. After two successful ISCA Tutorial and Research Workshops on 
Multimodal Dialogue Systems at the same location in 1999 and 2002, we felt 
that a workshop focusing on the role of affect in dialogue would be a valuable 
continuation of the workshop series. 

Due to its interdisciplinary nature, the workshop attracted submissions from 
researchers with very different backgrounds and from many different research 
areas, working on, for example, dialogue processing, speech recognition, speech 
synthesis, embodied conversational agents, computer graphics, animation, user 
modelling, tutoring systems, cognitive systems, and human-computer interac- 
tion. Overall, ADS 2004 embodied 23 long papers, and 12 short and demonstra- 
tion papers, not only from 10 Western and Eastern European countries, but also 
from Ganada, Japan and the US. The papers cover the following seven topic 
areas: 

— emotion recognition 

— affective user modelling 

— affective conversational agents and dialogue simulation 

— emotional databases, annotation schemes and tools 

— synthesis of emotional speech and facial animation 

— affective tutoring systems 

— evaluation of affective dialogue systems 

The main characteristic of the papers in this volume is their endeavor to 
go beyond pure task-oriented approaches to dialogue processing to also address 
conversational aspects as well as psychological and social concerns. For example 
the volume includes a number of empirical studies that investigate in what sense 
the human user could benefit from the imitation of social behaviors. 

A theme that came up in a number of contributions in this volume is the 
important role of emotional factors in the development of embodied conversa- 
tional agents. Many authors are convinced that the integration of an affective 
component can significantly enhance an agent’s believability. 
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Preface 



For an agent to behave human-like, appropriate synchronization of speech, 
mimics and gestures is highly important and is a major concern in research on 
affective dialogue systems. But also research on the individual output compo- 
nents is crucial to the advances in such systems. In this volume this is witnessed 
by a number of papers, not least on emotional speech synthesis. 

The work on tutoring systems presented in this volume indicates that agents 
that attend to and attempt to influence the motivational state of students may 
aid the learning process. 

In addition to work on the expression of emotions, papers on the recognition 
and interpretation of the user’s emotional state are presented in this volume. 
Interestingly, the authors approach this topic from various perspectives. One 
group focuses on the signal-processing level, aiming to recognize typical emo- 
tional patterns from biosensors, facial expressions and speech. Another group 
extends traditional work on user modelling with an affective component. 

Finally, the volume includes papers on the evaluation of affective dialogue 
systems. Not least on the usability side, this is an area with many open ends. 

We would like to thank all authors for the effort they spent on their submissi- 
ons, and the program committee - more than 30 distinguished researchers from 
industry and academia - who worked very hard to tight deadlines and selected 
the best contributions for the final program. We are also grateful to the invited 
speakers for enriching our workshop. 

In addition, we would like to express our thanks to several people who assisted 
us in organizing the workshop. Torben Kruchov Madsen took care of the Web 
page for uploading papers. Angela Rittinger and Brigitte Waimer-Eichenauer 
provided worthwhile administrative support. A number of organizations sup- 
ported ADS 2004 including ACL Sigmedia, ACL/ISCA Sigdial, Gesellschaft fiir 
Informatik (GI), and ISCA. In particular, we gratefully acknowledge GI for their 
valuable assistance in handling the financial matters. Last, but not least, we are 
grateful to Springer- Verlag for publishing the proceedings in their LNCS/LNAI 
series. 
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From Emotion to Interaction: Lessons from Real 
Human- Machine- D ialogues 



Anton Batliner^, Christian Hacker^, Stefan Steidl^, Elmar Noth^, and 

Jurgen Haas^ 

^ University of Erlangen-Nuremberg, Lehrstuhl fiir Mustererkennung / Chair for 
Pattern Recognition, Martensstr.3, 91058 Erlangen, F.R.G 
batlinerSinf ormat ik . uni-erlangen . de 
^ Sympalog Voice Solutions GmbH, Karl Zuckerstr. 10, Erlangen, F.R.G. 



Abstract. The monitoring of emotional user states can help to assess 
the progress of human-machine-communication. If we look at specific 
databases, however, we are faced with several problems: users behave 
differently, even within one and the same setting, and some phenomena 
are sparse; thus it is not possible to model and classify them reliably. We 
exemplify these difficulties on the basis of SympaFly, a database with 
dialogues between users and a fully automatic speech dialogue telephone 
system for flight reservation and booking, and discuss possible remedies. 

1 Introduction^ 

It might be fair to describe one (maybe ‘the’) basic conceptualization of using 
information on emotions within automatic dialogue systems in the following way: 
if we detect something like anger, let’s initiate some recovery strategy or hand 
over to a human operator. If we detect something like joy, try to utilize this 
information, for instance, by offering some new, good bargain. This seems to be 
a realistic vision if we consider the good classification rates obtained for some 
basic emotions in the laboratory. As far as we can see, however, the few studies 
conducted during the last years dealing with non-acted emotions recorded in a 
realistic scenario report rather a negative correlation between full-blown, proto- 
typical emotions on the one hand, and frequency on the other hand; moreover, 
the recognition rates for real-life speech data go down considerably, cf. [2, 5, 1,9]. 
We believe, that a way out of this dilemma is not only to collect more data but 
first of all, to take into account more phenomena: the monitoring of the user’s 
behavior should not only consider some basic emotions but all kind of emotional 
user states, and in addition, we should look for any change in the user’s beha- 
vior towards other ‘suspicious’ directions, e.g., use of meta-talk or of repetitions. 

^ This work was funded by the EU in the project PF-STAR {http://pfstar.itc.it/) 
under grant IST-200 1-37599 and by the German Federal Ministry of Education, 
Science, Research and Technology {BMBF) in the SmartKom project under Grant 
01 IL 905 K7. The responsibility for the contents of this study lies with the authors. 
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Thus, the focus of interest has to be shifted from a subject-centered towards an 
interaction-centered point of view, cf. section 6. 

In this paper which is reporting work in progress, we first present SympaFly, 
a fully automatic speech dialogue telephone system for flight reservation and 
booking. In the first stage of this system, performance was rather poor (approx. 
30% dialogue success rate); in the last stage, performance was very good (above 
90% dialogue success rate). All dialogues were orthographically transliterated 
and annotated as for (emotional) user states, prosodic peculiarities, dialogue 
(step) success rate, and conversational peculiarities. For classification of user 
states, a large prosodic feature vector was used. We will show that users employ 
different strategies, and that it is really mandatory to deal with the sparse data 
problem as far as emotional user states are concerned. 

2 The SympaFly Database 

SympaFly is a fully automatic speech dialogue telephone system for flight reser- 
vation and booking. The database comprises three different stages; the methodo- 
logy consisted of a rapid prototyping phase followed by optimization iterations. 
Subjects were asked to call the automatic dialogue system and book one or more 
flights. The caller should, for instance, book a flight from Zurich to Tiflis and 
back so that the meeting there can take place at a specific time. Additional 
information had to be given, e.g., frequent flyer id, credit card number, and so 
on. The three evaluation stages can be characterized as follows; a more detailed 
account of the system design can be found in [4]: 

— The first part of the data set SI (110 dialogues, 2291 user turns, 11581 
words; 5.1 words per turn, 105 words and 20.8 turns per dialogue) are those 
dialogues which were collected in the first test of the system, conducted by 
an independent usability lab, built by only using the input of involved system 
developers and designers, without any external evaluation whatsoever. The 
performance of the system was rather poor. 

— The dialogues in the second phase S2 (annotated and processed: 98 dialo- 
gues, 2674 user turns, 9964 words; 3.7 words per turn, 102 words and 27.3 
turns per dialogue) cover several system phases, wherein the system perfor- 
mance was increased little by little, sometimes from one day to the other. 
Due to this, the individual dialogues can strongly differ depending on the 
system performance at a particular time. Callers were volunteers without 
any connection with the usability lab. 

— Finally, the third part S3 (62 dialogues, 1900 user turns, 7655 words; 4.0 
words per turn, 124 words and 30.6 turns per dialogue) contains dialogues 
collected through the final system, by using the same experimental setting 
as for SI: same telephone channel, callers are supervised by the usability lab. 
The performance of the system was now excellent. 
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3 Annotations and Featnre Extraction 

For the annotation of holistic (emotional) user states, no pre-defined set of 
labels was given; two labellers decided themselves which and how many different 
user states to annotate; interlabeller correspondence is discussed in [4]. After a 
first, independent run the labellers decided on a consensus labelling in a second 
run. The following turn-based labels (given in italics) were used and mapped onto 
these five cover classes (given recte and in boldface): positive: Joyful; neutral: 
Neutral; pronounced: Emphatic; weak negative: Surprised, Ironic, strong 
negative: Helpless, Panic, Touchy (i.e., irritated), Angry. Emphatic is taken 
as sort of ‘basically suspicious’ - in our scenario most likely not positive, but 
indicating problems; this assumption will be discussed further below. 

It can be assumed that users encounting difficulties in the communication 
with a system, change their way of speaking, for instance, by emphasising salient 
information. In Table P, the labels used for the annotation of such prosodic 
peculiarities are given, arranged according to their presumed strength; labels 
covering more than one strength level can be either the one or the other level. 
(For a two-class problem, the three labels given in italics could be attributed to 
the (cover) class neutral.) Laughter and syllable lengthening cannot be attributed 
to one specific level of prosodic strength. More than one label can be attributed 
to the same word; in such a case, for the mapping onto strength levels, the 
strongest one ‘wins’. This is again a consensus labelling of two annotators. The 
label set has been used in the Verbmobil- and in the SmartKom-project [2,11]. 



Table 1. Prosodic peculiarities, annotated word-based, and their strength 



weak 


medium 


strong 


pausB-phrase 


pause_word 


pause-syllable 


emphasis 


strong emphasis 


clear _articulation 


hyper-articulation 


lengthening-syllable 


laughter 



Another labeller annotated the success of a dialogue using four levels: 
null (no user confirmation, no booking), full (confirmation and booking), and 
two levels in between: some (maybe confirmation but no booking), and medium 
(confirmation, but no ‘ideal’ booking). In addition to this global measure, we 
annotate for each turn ten slots that can - but need not - be filled in each user 
utterance: departure, destination, date, time, class, persons, membership (in the 
frequent flyer program), number of membership, credit-card number, credit-card 
validity. These slot fillers can be compared with the preceding system utterance, 
and then we can decide whether a dialogue step has been successful or not. 

^ ‘pause_phrase’: extra long pause between syntactic units, ‘pause_word’: pause bet- 
ween words inside syntactic unit; ‘pause_syU’: pause inside word; the other labels are 
self-explanatory. 
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The computation of such ‘linguistic’ features and the recognition rates for dia- 
logue step success (83%) and dialogue success (85%) based on these features are 
reported elsewhere, in [10]. 

The following conversational peculiarities (i.e., special dialogue acts) 
were annotated by the same labeller: different types of repetition, different types 
of out-of-dialogue sequences (speaking aside, etc.), and no answer (if the user 
does not produce any answer at all). 

For spontaneous and emotional speech it is still an open question which 
prosodic features are relevant for the different classification problems, and 
how the different features are interrelated. We try therefore to be as exhaustive 
as possible, and we use a highly redundant feature set leaving it to the statistical 
classifier to find out the relevant features and the optimal weighting of them. 
For the computation of the prosodic features, a fixed reference point has to 
be chosen. We decided in favor of the end of a word because the word is a 
well-defined unit in word recognition, and because this point can more easily 
be defined than, for example, the middle of the syllable nucleus in word accent 
position. 95 relevant prosodic features modelling duration, energy and FO, are 
extracted from different context windows. The context was chosen from two 
words before, and two words after, around a word; by that, we use so to speak a 
‘prosodic five-gram’. In addition, we use 30 part-of-speech labels modelling the 
same ‘five-gram’ context. Details are given in [2]. With other types of features 
(spectral, linguistic, etc.), classification experiments are on-going and will be 
reported elsewhere. 



4 Different User Strategies: More Things between 
Heaven and Earth 

Figure 1 illustrates the improvement in dialogue success from SI via S2 to S3: an 
almost equal distribution of the four levels for SI on the one hand, and approx. 
90% full dialogue success for S3, S2 being in between. In Figures 2 to 4, the 
frequencies in percent of different label types are given for the four levels of 
dialogue success. In Figures 2 and 3, all the bars for each success level sum up, 
together with the neutral cases which are not shown, to 100%. In Figure 4, each 
user state sums up across all four success levels to 100%. 

In Figure 2, a marked drop in frequency for full success can be seen for out- 
of-dialogue sequences and no answers. Repetitions, however, are almost equally 
distributed across all four success levels. This is, at first glance, a bit puzzling 
if we assume that repetitions are generally to be taken as indications of misun- 
derstandings: if this holds true, fully successful dialogues should on the average 
produce less repetitions than dialogues with medium, small or no success. 

A similar distribution is displayed in Figure 3 where four cover classes of 
prosodic peculiarities {medium of Table 1 mapped onto strong) are displayed for 
the four different success levels: if we take such peculiarities as a (possible and 
likely) indication of misunderstandings then the two less successful levels should 
produce more prosodic peculiarities than the two more successful levels; in fact. 
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null some med. full 



Fig. 1. Distribution of dialogue success 
for the three system stages 




null some med. full 



Fig. 3. Dialogue success and frequencies 
of prosodic peculiarities in percent 




null some med. full 



Fig. 2. Dialogue success and frequencies 
conversational peculiarities in percent 




null some med. full 



Fig. 4. Dialogue success and frequencies 
of user states (5 cover classes) in percent 



for weak, it is the other way round! The other three prosodic cover classes have 
a very low frequency throughout. 

Finally, Figure 4 displays the frequencies of the five cover classes for the user 
states, including neutral. Although neutral turns are more frequent in the fully 
successful dialogues, the opposite is not necessarily true for the marked cases: 
esp. strong negative cases are rather equally distributed, and there are more em- 
phatic cases for medium and full success. (Note that most of the emphatic words 
are ‘marked prosodically’ as well (72.3%), but only 33.1% of the ‘prosodically 
marked’ words are labelled as emphatic.) 

Of course, the caveat has to be made that this is only a snapshot - we do 
not know whether this picture holds across other databases, and we cannot fully 
disentangle all possible factors. Moreover, some of the phenomena that have 
been discussed in this section have a rather low overall frequency: in Figure 2, 
repetitions 3.1%, and out-of-dialogue sequences 3.8%. The same holds at least 
for four user states (frequencies of all user states are given below in Table 3): 
the 58 joyful labels are found in 13 turns out of 11 dialogues. The corresponding 
figures for surprised are 31 tokens, 5 turns, 3 dialogues; for panic: 43 tokens, 6 
turns, 6 dialogues; and for angry: 40 tokens, 3 dialogues, 3 turns. Thus the high 
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frequency of joyful for null success in Figure 4 could be due to some spurious 
factors, to some malignance, or to real joy if some dialogue step goes well in an 
otherwise unsuccessful dialogue. 




Fig. 5. Frequencies in percent of consecutive dialogue step failure in adjacent user 
turns for each of the four dialogue success levels; on the x-axis, the number of successive 
unsuccessful dialogue steps is given 




Fig. 6. Frequencies in percent of consecutive dialogue step success in adjacent user 
turns for each of the four dialogue success levels; on the x-axis, the number of successive 
successful dialogue steps is given 



We believe, however, that a possible - and likely - explanation for all these 
prima facie puzzling distributions (more ‘suspicious’ peculiarities in more suc- 
cessful dialogues) might partly be that we simply have to deal with two different 
types of user personalities and strategies: less co-operative users might stick to 
their usual way of speaking - both w.r.t. casual speaking style and dialogue acts. 
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More co-operative speakers might try hard to make themselves understood by 
changing their speaking style from casual to lento/clear, and by varying their 
dialogue strategies by using more repetitions and/or reformulations than usual. 
Phenomena as repetitions and prosodic peculiarities, and maybe at least some of 
the marked user states, might thus be still an indication of some local difficulties, 
but overall frequency cannot simply be taken as an indication of global dialogue 
failure. 

This hypothesis can nicely be illustrated by Figures 5 and 6: the underlying 
assumption is that the more consecutive, adjacent dialogue step failures we can 
observe in a dialogue, the more likely it is that the whole dialogue turns out 
to be a failure, i.e., is not annotated as fully successful - and vice versa: if a 
dialogue contains many consecutive, adjacent successful dialogue steps, then it 
is more likely that the whole dialogue will be successful. In Figures 5 and 6, we 
can observe a sort of turning point at the threshold between 5 and 6 consecutive 
dialogue step successes/failures: for 6 and more, the bars for null are normally 
higher than the bars for full in Figure 5, and vice versa, the bars for full are 
normally higher than the bars for null in Figure 6. The intermediate levels some 
and medium are in between, some tending towards null and medium towards 
full. Obviously, after up to five consecutive dialogue step failures, the dialogue 
can still recover gracefully - most likely with the help of a co-operative user. The 
overall frequency of six and more consecutive dialogue step failures/successes is, 
of course, low in comparison with the rest which is rather equally distributed 
amongst the four dialogue success levels. 



5 Sparse Data: You Can’t Always Get What You Want 
(to Classify) 



As base-line classifier we use linear discriminant analysis (LDA) with our word- 
based features as predictor variables (95 prosodic, 30 POS); experiments with 
Neural Networks and Decision Trees and with turn-based labels are on-going. We 
use the spoken word chain, assuming 100% correct word recognition, and classify 
on the word level. In Tables 2 and 3, we report two figures for our classification 
tasks: the overall recognition rate RR (# of correctly classified items divided 
by # of all items), and the class- wise computed recognition rate CL (average of 
recognition rates across all classes).^ 

In Table 2, RR and CL are shown for different combinations of features given 
in the first column features (#), and for two different constellations of learn vs. 
test sample: first we have not divided the database into training and test sample, 
but have resorted to leave-one-out (columns loo). In addition, we divided the 
database into a training sample consisting of all items from SI and S3, i.e., from 
the usability lab, and a test sample, consisting of all items from S2, i.e., from 

® For these experiments, we used the words from all turns except the ones from those 
five neutral turns of each speaker that were used for the normalization sample for 
an additional set of spectral features. 
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volunteering people without any connection with the usability lab This is 

a fair but at the same time, rather difficult test because we are definitely faced 
with different telephone-channels and a different proficiency of the callers. In 
Table 2, we address three 2-class classification tasks: first, the two cover classes 
no-problem (joyful, neutral) vs. problem (the other 7 labels); results are given 
in the columns with the title user states. For the second task, we put all those 
cases into the class ‘problem’ which are not labelled as joyful or neutral and, 
at the same time, are labelled as prosodically marked vs. the rest: items which 
are marked prosodically but at the same time labelled as joyful or neutral, items 
which are not marked prosodically but at the same time labelled as not joyful or 
neutral, and items which are not marked prosodically and labelled as joyful or 
neutral. These figures are given in the columns with the title clear vs. rest. For 
the third task, we only use clear cases and cases that are either neutral or joyful 
and at the same time, not marked prosodically (columns clear vs. unmarked)] 
for this task, only 13916 out of 21125 cases are processed. With POS features, 
recognition rates are up to two percent points better than with prosodic features 
alone. RR and CL for hf^t are always only some few percent worse that for loo; 
thus it is likely that these results will generalize.^ The ‘good’ results above 80% 
for the third task clear vs. unmarked indicate that the prosodic annotation is 
reliable. Results for the more realistic first two tasks that are (slightly) below 
or above 70% are, in our experience, realistic but of course not very good for a 
two-class problem. This might be due to the fact that there are not that many 
marked cases in our database (sparse data problem, cf. the frequencies in Table 
3), and that these cases are based on different user states with at least partly 
different prosodic marking. Moreover, linguistic features that are of course not 
modelled by acoustics play a role as well. 



Table 2. Percent correct classification for three different tasks, cf. explanation in text, 
2-class problems, LDA, leave-one-out and learn%test 





user states 


elear vs. rest 


clear vs. 


unmarked 


features (#) 


loo 




loo 


l^t 


loo 


h^t 




RR 


CL 


RR 


CL 


RR 


CL 


RR 


CL 


RR 


CL 


RR 


CL 


prosodic (95) 


71.1 


70.3 


69.7 


65.5 


74.7 


72.9 


72.6 


69.4 


82.3 


80.3 


80.8 


79.4 


p.-fPOS (125) 


72.6 


72.3 


69.7 


67.3 


75.9 


74.8 


72.3 


71.9 


83.3 


81.5 


81.5 


80.7 



Table 3 gives an impression of the recognition performance if all nine user 
states have to be classified. This can only be done for the loo task, due to the 
fact that for several classes, there are only a few items. Classes with more items 
(neutral, emphatic) yield better recognition rates than classes with only a few 
items. The low recognition rates for ironic could be expected because this user 

^ Note that in loo, the speakers are ‘seen’ whereas in l^t, the speakers from the training 
sample are disjunct from the speakers from the test sample. 
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Table 3. Percent correct classification for all 9 user states, chance level 11.1% 



features 


0 

1 


Neutral 


Emph. 


Surpr. 


Ironic 


Helpl. 


Panic 


Touchy 


Angry 


RR 


CL 


# 


58 


15390 


3708 


31 


395 


654 


43 


806 


40 


21125 


21125 


pros. 


19.0 


38.5 


37.7 


19.4 


9.6 


30.1 


7.0 


24.3 


27.5 


36.9 


23.7 


p.+POS 


15.5 


44.6 


41.5 


12.9 


10.6 


32.3 


20.9 


21.7 


25.0 


42.0 


29.7 



Table 4. Mapping of the 9 user states onto 4 cover classes, word-based leave-one-out 
and learn^^i^test, turn-based only leave-one-out 



COVER CLASS 


user states 


jf loo 


H test in If^t 


loo turn-based 


NEUTRAL 


joyful, neutral, ironie 


15843 


5068 


4492 


EMPHATIC 


emphatic, surprised 


3739 


1691 


583 


HELPLESS 


helpless 


654 


59 


49 


MARKED 


panic, touchy, angry 


889 


243 


167 



state should definitely not be marked prosodically. Again, POS features contri- 
bute to classification performance, cf. RR and CL; these figures are well above 
chance level, but not very good. We refrain from a more detailed interpretation 
because we do not know yet whether these results will generalize. Anyway, we 
are far from the classification performance obtained in other studies for acted, 
full-blown emotions. 

Obviously, we have to map at least some of the detailed user states onto cover 
classes in order to overcome the sparse data problem. One meaningful possibility 
- of course, other mappings can be imagined as well - to do this is given in Table 
4: the very few joyful cases are mapped onto NEUTRAL, i.e., no action has to be 
taken. Ironic is mapped onto NEUTRAL as well because it does not make any 
sense to try and recognize it with prosodic features. Both surprised and emphatic 
are mapped onto EMPHATIC because they denote those cases where we do not 
know whether they stand for some ‘negative’ emotion or for some co-operative 
effort. In the case of HELPLESS, some help could be offered by the system, and 
in the case of MARKED {panic, touchy, angry), the system should definitely try 
to find a way out. The third and the fourth column display number of cases 
for each cover class and for the two classification tasks. Classification rates for 



Table 5. Percent correct classification for the four COVER CLASSES, LDA, 95 prosodic 
and 30 POS features, chance level 25%, word-based leave-one-out and learny^test, turn- 
based only leave-one-out 



domain 




NEUTRAL 


EMPHATIC 


HELPLESS 


MARKED 


RR 


CL 


word-based 


loo 


61.7 


49.1 


54.6 


33.9 


58.1 


49.8 


word-based 


h^t 


61.1 


57.6 


42.4 


28.0 


59.0 


47.3 


turn-based 


loo 


70.4 


58.8 


51.0 


46.1 


68.2 


56.6 
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these four classes are given in Table 5 for word-based loo and ly^t; with such 
a classification performance, we could maybe try to reduce the search space, in 
combination with other knowledge sources, cf. [5]. 

Another possibility to improve recognition performance is to go on to turn- 
based classification. Our user state labels have been annotated turn-based - 
to keep the effort low, and because in a human-machine-dialogue, the system 
normally reacts to turns and not to parts of turns. On the one hand, not every 
word in a marked turn might have been produced in a marked way: this might 
hold for only one (sub-) phrase and/or only for salient (content) words but 
not necessarily for every function word. Our labelling is thus based on a sort 
of majority voting: the annotater decided that in a turn, there are so many 
salient words produced in a marked way that she labelled the whole turn as 
marked. On the other hand, it really might not be necessary to know which 
words are produced in a marked way indicating a specific user state. It might be 
sufficient that it is some/most of the salient words. Different ways of computing 
a majority voting can be imagined. Here we first computed a mean prohahility 
value for each turn from the probability values of each word obtained from the 
word-based classification step, cf. for the two-class problem Table 2, columns user 
states, loo, line p.-|-POS, and for the four-class problem. Table 5, line loo for the 
word-based computation. (Due to sparse data, h^t does here not make any sense, 
cf. the last column in Table 4.) In addition to this score-feature, we computed 
global prosodic features (mean or sum of the word-based prosodic features for 
all words in a turn) and used both mean probability values and global prosodic 
features as predictors for classification. By that, we so to speak combine local 
(word-based) with global (turn-based) information. For the two-class problem, 
we could improve RR from 72.6% word-based to 78.9% turn-based, and CL from 
72.3% word-based to 76.3% turn-based. For the four-class problem, the last line 
in Table 5 shows the turn-based classification results. Again, the improvement 
is evident: RR from 58.1% word-based to 68.2% turn-based, and CL from 49.8% 
word-based to 56.6% turn-based. 

6 From Emotion to Interaction 

In [4], we reformulated Labov’s observer’s paradox tailoring it for the study of 
emotion. Now we want to broaden the view, from emotion to interaction: we are 
faced with the problem that on the one hand, clear-cut indications of emotions 
are sparse, but on the other hand, we can observe different user’s behavior, i.e., 
different roles the user can take over, cf. the social constructivist perspective of 
emotion [7,8].^ A promising way to overcome this problem is thus to shift the 
focus, away from ‘private, egocentric, solipsistic, subject-centered, monologic' 
emotions towards ^dialogic, partner-oriented’ attitudes that have an impact on 

® This is even more obvious if it comes to human-robot-communication, cf. [3] , where 
a database with children interacting with Sony’s AIBO robot is analyzed: one type 
of users conceive the AIBO only as a remote control toy, the other type establishes 
a relationship with a sort of pet dog. 
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the communication by, e.g., defining or altering the role-specific behavior of the 
user and, by that, maybe of the system as well - if it is capable to do that. Such a 
concept fits nicely into the development of automatic speech processing systems 
sketched in Table 6 where we attribute speech acts and their definition to the 
pertaining realm of phenomena and application systems: automatic dictation 
and dialogue systems, and, finally, automatic interaction systems - systems that 
take into account the user’s attitudes towards itself and vice versa. This inclu- 
des all the ‘fringe’ phenomena without clear semantics but with paralinguistic, 
interpersonal impact (backchannelling, fillers, etc.), communicative strategies 
(repetitions, reformulations, etc.), and indication of attitudes/user states - but 
not those pure emotions that normally are not signalled overtly (As for similar 
considerations from the point of view of ‘expressive’ synthesis, cf. [6]). 

What about affective, emotional systems? In the present scientific discourse, 
the term ‘emotion’ is used in a multifarious way - because people are aware 
that a concept limited only to prototypical emotions might be of limited use for 
automatic systems. Still the bulk of knowledge on emotions has been collected 
using full-blown, acted emotions in the laboratory - and this very knowledge 
forms the basis for the conceptualisations of using emotions within automatic 
systems, cf. section 1. We believe that such emotional acts modelled and used in 
affective systems will only be useful for some special purposes, as, e.g., computer 
games. Eventually, we will end up with interpersonal acts within interactive 
systems. 



Table 6. From linguistics to paralinguistics, from emotion to interaction 



speech acts 


definition 


realm 


systems 


locution 


the act of saying 


words /sentences 


dictation 


illocution 


reference to speakers purpose 


dialogue acts 


dialogue 


perlocution 


effects on behavior, feelings, beliefs, 
actions, etc. of a listener 


emotional acts 


affective 


interpersonal acts 


interaction 



7 Conclusion and Future Work 

As a first step, we advocate a sort of example-based surveying by using diffe- 
rent databases, different acoustic feature sets, and different algorithms without 
intending to get at a unified approach - this will only be possible in a later 
stage. It depends crucially on the specific kind of database which phenomena 
one can expect and how many (sparse data problem), which different strategies 
users employ, and whether voice characteristics can simply be averaged across 
speakers or have to be modelled separately for each speaker; i.e., there is not one 
set of phenomena which maybe sometimes have to be clustered differently, as in 
the case of accents or prosodic/syntactic boundaries, but many different sets. 

As for SympaFly, we want to concentrate on two different strategies to im- 
prove classification performance: first, we want to test other alternatives of turn- 
based classification using global features and/or majority voting for word based 
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features. Second, we want to use additional linguistic information as, e.g., lan- 
guage models, for classification. 
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Abstract. Emotions in continuous speech were analyzed using inverse 
filtering and a recently developed glottal flow parameter, the normalized 
amplitude quotient (NAQ). Simulated emotion portrayals were produced 
by 9 professional stage actors. Segments of the vowel /a:/ were sepa- 
rated from continuous speech. The segments were inverse filtered and 
parametrized using NAQ. Statistical analyses showed significant differ- 
ences between most studied emotions. Results also showed clear gender 
differences. Inverse filtering together with NAQ was shown to be a suit- 
able method for analysis of emotional content in continuous speech. 



1 Introduction 

In speech, emotion is communicated by a combination of features at all three 
principal levels of speech abstraction: suprasegmental, segmental, and intraseg- 
mental [1]. All of these three levels can be considered as having two components, 
namely verbal (words) and vocal (intonation and voice quality). The vocal in- 
formation at any of these levels can render the verbal information redundant, 
qualify it further, or directly contradict it. At suprasegmental and segmental lev- 
els, this information includes fundamental frequency (/o), sound pressure level 
(SPL), and duration patterns, as well as variations in formant configurations. 
However, Pollack and colleagues [2] noted that emotion can be recognized in 
segments of speech as short as 60 ms. On this intrasegmental level, vocal expres- 
sion of emotion is performed with various voice quality adjustments generated 
by manipulation of the voice source (i.e. the glottal volume velocity waveform 
generated by the vibrating vocal folds). 

Numerous studies on fundamental frequency and intensity patterns have been 
conducted in the past decades [1,3]. Scherer [4] suggests that /o, loudness and 
temporal characteristics correlate with activation dimension of emotion, while 
the discrete emotions themselves can not be modeled well using those features. 
Scherer [5] further concludes that the key to the vocal differentiation of discrete 
emotions is the voice quality. He also cites the conceptual and methodological 
difficulties as the reason for the neglect of voice quality in empirical studies of 
vocal expression of emotion. The voice quality is defined as the characteristic 
auditory coloring of an individual speaker’s voice, to which both laryngeal and 
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supralaryngeal features contribute [6]. The laryngeal features are all embedded 
in the airflow through the glottis. A method called inverse Altering is used to 
derive the glottal flow waveform from the extraoral sound pressure or oral volume 
velocity waveform [7,8]. In inverse Altering, the signal is passed through Alters, 
which are set to match the inverse effect of the vocal tract (and lip radiation, 
if applicable). This removes the effect of the formants from the signal, resulting 
in an approximation of the glottal waveform. Several methods for automatically 
setting the inverse Alters have been proposed [9,10,11]. 

Inverse Altering by itself only results in a glottal flow waveform, which needs 
to be parametrized to get quantitative measures of the voice source function. Dif- 
ferent parameterization types include time-based parameters, amplitude-based 
parameters, frequency domain parameters, and function fitting [12,13,14]. Some 
studies regarding both glottal flow and voice quality, as well as glottal flow and 
expression of emotion have been conducted [15,16,13,17,18,19]. However, in these 
studies the number of subjects has generally been rather small, and the param- 
eters used may have been suboptimal in describing the voice quality changes. 

In this study, voice quality changes in vocal expression of emotion in contin- 
uous speech are quantitatively measured. Vowel segments of short duration are 
acquired from simulated emotion portrayals and inverse Altered with a sophisti- 
cated automatic inverse Altering method. The resulting glottal volume waveform 
is parametrized using a novel and robust parameter measuring the characteris- 
tics of the closing phase of the glottal cycle. The parameter data is statistically 
analyzed to And out differences between the emotions. 

2 Materials and Methods 

2.1 Speech Samples 

The chosen emotion elicitation method was simulated portrayals of vocal ex- 
pression of emotion. Natural emotion expression and induced emotions were laid 
aside due to difficulties obtaining good, representative emotional samples [3]. 

Five different emotions were used: neutral, sadness, joy, anger, tenderness. 
The emotions were chosen so that they can be clearly separated in the activation- 
evaluation space. Sadness and tenderness have low activation, while joy and 
anger have high activation. The valence of sadness and anger is negative, and 
positive in joy and tenderness. The label “neutral” , while it may not be an 
emotion in the strict sense, stands for average valence and activation. 

Nine professional stage actors employed at the City Theater of Oulu, Finland 
served as speakers in the study. The subjects (five males and four females) were 
26-45 years of age (median: 42). They were all native speakers of Finnish. The 
subjects were paid for the recording. 

The speech material in the study was a text passage of 83 words of Finnish 
prose. The contents of the passage could easily be expressed in different emotions. 
The recitations took approximately one minute each. No explicit scenarios were 
given for different emotions, but instead the actors were simply asked to recite 
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the passage in given emotion. The emotions were referred by their Finnish names 
only ( “neutraali” , “sum”, “ilo”, “viha”, and “hellyys”). The actors were free to 
use any method they deemed suitable for the emotion induction. 

The five different emotions were repeated by the nine actors ten times each, 
giving a total of 50 recitations per actor (referred to as the recitation number). 
Thus, the total number of spoken passages was 450. The order of the recitations 
was chosen so that they were randomized, but no emotion was recited twice in 
a row. 

The speech samples were recorded over a course of three days in a radio 
anechoic chamber at University of Oulu. The subject was in the chamber to- 
gether with an instructor, who supervised the recording session. A microphone 
was placed at a distance of 50 cm from the subject’s mouth. The distance was 
controlled before each recitation using a string tied to the microphone stand. 

Before recording each subject, the microphone signal was calibrated using a 
calibrator (Briiel & Kjaer 4231). The recording microphone was Briiel & Kjaer 
4188, which has a frequency range from 8 to 12500 Hz (±2dB). The microphone 
was connected through a preamplifier (Briiel & Kjaer 2138 Mediator) to a Sony 
DTC-690 DAT recorder. The DAT recorder used a standard sampling rate of 
48 kHz. To prevent signal degradation, the recorded signals were digitally trans- 
ferred from DAT tapes to a computer. 

The first 40% of the recitations of each actor were chosen for further analysis. 
Signal segmentation was performed manually. Three /a:/ vowels, surrounded 
by an unvoiced plosive or fricative, were cut from predefined positions in the 
beginning, middle and at the end of each of the 180 analyzed recitations (words 
[taashan], [taakkahan], [muistakaapa]). 

The /a:/ vowels were downsampled to 22.05 kHz and 40 ms sound segments 
were cut from them. The location of the segment was selected so that the segment 
is at a stationary part of the vowel, and, if possible, at the middle of the vowel. 
Stationary segments slightly shorter than 40 ms were accepted as well. 

2.2 Estimation of the Glottal Flow with Inverse Filtering 

The inverse filtering method used in this study was iterative adaptive inverse 
filtering (lAlF) [11]. The block diagram of the lAlF method is shown in Fig. 
1. The only input required for estimation of the glottal flow with lAlF is the 
speech pressure waveform s(n) captured in free-held with a microphone. The 
output g{n) is the estimated glottal how. Fig. 2 shows examples of the input 
and output signals. The method can be completely automated, although some 
supervision and interactive optimization of the parameters often greatly im- 
proves the quality of the inverse hltering. lAlF can be implemented using either 
linear predictive coding (LPC) or discrete all-pole modeling (DAP) as an all- 
pole modeling technique. In this work DAP was used, since it is able to estimate 
the vocal tract hlter more accurately than LPC, especially for high fundamental 
frequencies [20]. 

lAIF works by calculating a hrst estimate of the glottal waveform by inverse 
hltering, and then repeating the process by using the enhanced vocal tract model 
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Fig. 1. The block diagram of the lAIF method for estimation of the glottal excitation 
g(n) from the speech signal s(n). 



acquired in the first repetition as a basis for the next repetition. For a detailed 
description of lAIF, see e.g. [21]. 

The lAIF algorithm was implemented in MATLAB. A graphical user inter- 
face was devised for convenient inverse filtering of the signals. In the process, 
the number of formants and lip radiation parameters were adjusted to obtain an 
optimal glottal flow waveform. The number of formants varied typically from 8 
to 14 (median: 11), while the lip radiation value varied from 0.97 to 1.0 (median: 
0.9925). The quality of the waveform was graded subjectively from 0 (rejected) 
to 3 (excellent waveform). 

The inverse filtering actually succeeded for 86% of the vowel samples, the rest 
of them being non-voiced, very non-stationary, clipped or otherwise maligned. 
From the successfully inverse-filtered glottal waveforms, the consecutive well- 
formed glottal pulses were selected by visual inspection. 




Fig. 2. An example of a short segment of a speech pressure waveform s(n) of a female 
speaker in “tenderness” (upper pane), and the corresponding glottal flow waveform 
g(n) estimated by lAIF (lower pane). The y-axis values are arbitrary. 
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2.3 Parameterization of the Glottal Waveform 

In inverse filtering studies, the closing quotient (CIQ) has been widely used for 
parameterization of the glottal flow [12,22,23]. The CIQ is defined as the ratio 
between the duration of the glottal closing phase and the fundamental period. 
The CIQ is one of the most important time-domain parameters, as it is affected 
by the changes of the glottal pulse during its closing phase, during which the 
main excitation of the vocal tract occurs [24]. The value of CIQ reflects changes 
that occur in the glottal source when, for example, vocal intensity or phonation 
type is changed [12,25]. 

Normalized amplitude quotient (NAQ) is a new time-domain voice source 
parameter closely related to the closing quotient [26] . It can be used in the same 
manner as CIQ in analyzing the behavior of the glottal source in intensity or 
phonation changes or when the voice is loaded [27]. It has been shown to be more 
robust and to have a smaller variance than CIQ, thus providing results with a 
higher level of detail. Due to these reasons, NAQ parameterization was used in 
this study. 

NAQ is calculated as follows: 

NAQ = , (1) 

^peak-^ 

where fac is the peak-to-peak flow (ac flow) of the glottal pulse, dpeak is the 
absolute amplitude of the negative peak of the first derivative of the flow wave- 
form, and T is the period length. A visual representation of the different values 
is given in Fig. 3. 
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0.056 0.058 0.06 0.062 0.064 0.066 0.068 0.07 0.072 0.074 0.076 



Fig. 3. Normalized amplitude quotient (NAQ) values calculated from an inverse filtered 
glottal flow waveform (upper pane) and its first derivative using Eq. (1) (lower pane). 
The y-axis values are arbitrary. 



The implementation of NAQ parameter calculation was done in MATLAB. 
NAQ analysis was included in the lAIF user interface, so that NAQ values were 
automatically acquired from the selected glottal pulses. 
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2.4 Statistical Analysis 

Analysis of variance (ANOVA) was used to investigate differences among groups. 
Emotion, subject gender, vowel position within the passage and the ordinal num- 
ber of recitation were used as the factors, while the NAQ value was the depen- 
dent variable. Tukey’s honestly significant difference was then used to determine 
significant differences between the emotions. 

2.5 Auditory Discrimination Test 

No screening of the actor performances was done. However, an auditory discrim- 
ination test was constructed to verify whether judges can differentiate emotions 
from the analyzed short vowel sections. The results were also used to verify 
that the labeling of the simulated portrayals correlates with the results of the 
discrimination test. 

The test was performed by ten judges, of which 5 were women and 5 men. 
The judges were between 22 and 49 years of age. All of the judges had normal 
hearing. The judges were asked to classify the vowel clips to given emotions. 
They were presented with a total of 540 vowel clips, of which 54 (10%) were 
duplicated so that intra-subject reliability could be monitored. The judges were 
asked to categorize the samples into one of the five given emotion categories. The 
samples were presented in a randomized order, and each sample was repeated 
with an interval of approximately 250 ms, until a selection was made. 

Even though the duration of the sounds was extremely short, the overall 
results of the subjective test showed that the listeners were able to discriminate 
the emotions. Intra-judge repeatability was acceptable, yielding Spearman’s rank 
correlation coefficient of 0.40 for the classification of the duplicate vowel samples 

(p < 0 . 01 ). 

3 Results 

Mean, standard deviation and range (excluding the lowest and highest 5% of the 
samples discarded as outliers) values for the NAQ parameters are given in Table 
1. The respective box plots are shown in Fig. 4. 

The measured mean of all NAQ values was 0.108 (std.dev 0.036), with a 
range from 0.044 to 0.258. The NAQ values for females were larger than for 
males, suggesting on average a smoother glottal pulse. The standard deviation 
of NAQ for females was larger (both in absolute and relative terms) than for 
males, suggesting larger variation of voice quality for females. Although not 
shown here, the NAQ value means also had considerable variation between the 
subjects. 

Regarding the different emotions, angriness is the only emotion showing 
smaller NAQ mean than neutral. Joy, sadness and tenderness all exhibited larger 
NAQ value means than neutral, indicating that they were expressed on average 
with a smoother glottal pulse than neutral. However, the considerably higher 
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Table 1. Mean, standard deviation and range of normalized amplitude quotient (NAQ) 
in different emotions and genders. The lowest and highest 5% of the values were dis- 
carded as outliers. 



Gender Emotion Mean Std.dev Range 


All 


All 


0.108 


0.036 0.044-0.258 


Males 


all 


0.103 


0.032 0.043-0.250 


Females 


all 


0.118 


0.044 0.055-0.271 


All 


neu 


0.097 


0.024 0.057-0.166 


All 


sad 


0.122 


0.043 0.047-0.258 


All 


joy 


0.108 


0.032 0.043-0.228 


All 


ang 


0.087 


0.021 0.051-0.157 


All 


ten 


0.132 


0.041 0.069-0.271 


Males 


neu 


0.098 


0.027 0.059-0.166 


Males 


sad 


0.106 


0.035 0.047-0.177 


Males 


Joy 


0.098 


0.027 0.043-0.181 


Males 


ang 


0.088 


0.021 0.051-0.137 


Males 


ten 


0.122 


0.037 0.069-0.250 


Females 


neu 


0.097 


0.021 0.057-0.154 


Females 


sad 


0.138 


0.045 0.061-0.258 


Females 


Joy 


0.121 


0.035 0.058-0.228 


Females 


ang 


0.086 


0.022 0.055-0.157 


Females 


ten 


0.147 


0.047 0.079-0.271 



All subjects 




ang joy neu sad ten 
Emotion 





Fig. 4. Box plots of normalized amplitude quotient values with regard to different 
genders and emotions. 



standard deviation values for joy, sadness and tenderness than for neutral indi- 
cate larger variation in the voice quality within these emotions. 

The relative order of the NAQ value means between different emotions was 
the same for both males and females. The order from smallest to largest was 
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angry, neutral, joy, sadness and tenderness. Importantly, when comparing males 
versus females, it can readily be noted that the variation between different emo- 
tions was larger in females than in males. Within males, neutral and joy had 
NAQ means and standard deviations very close to each other, with a mean 
value of 0.098 (0.027). The range of NAQ values for joy was larger than for 
neutral, however. The NAQ means for females did not exhibit this similarity. 
While the total NAQ mean was higher for females than for males, in angriness 
and neutral emotion the NAQ means were smaller for females than for males. 
This indicates sharper glottal pulses in angriness and neutral emotion for females 
than for males. 

Analysis of variance (AN OVA) was used to study the sources of variability 
in the NAQ value data. Table 2 shows the results of the analysis of emotion, 
gender and recitation number with regard to the effects on NAQ value. It can 
be seen that both the emotion and subject gender have significant effect on the 
NAQ value (p < 0.001). Also the recitation number has effect on the NAQ value 
{p < 0.05). Further AN OVA analysis was performed on both genders separately. 
The results are shown in Tables 3 and 4. It can be noted that the recitation 
number has effect on the NAQ value for males, but not for females. 



Table 2. Analysis of variance on effect of different factors to the NAQ value. ‘Take’ 
is the recitation number. Vowel position in the recitation had no effect on the NAQ 
value, and has been omitted from the table. 



Df Sum Sq Mean Sq F value Pr{> F) 



Emotion 


4 


0.128 


0.032 


30.76 


<2e-16 *** 


Subject, gender 


1 


0.024 


0.024 


22.99 


2.2e-6 *** 


Take 


1 


0.007 


0.007 


6.36 


0.01199 * 


Emotion: Subject, gender 


4 


0.023 


0.006 


5.50 


0.00025 *** 


Emotion:Take 


4 


0.004 


0.001 


1.01 


0.40396 


Subject. gender: Take 


1 


0.001 


0.001 


1.06 


0.30284 


Emotion:Subject. gender: Take 


4 


0.005 


0.001 


1.20 


0.31154 



Signif. codes: 0 0.001 0.01 0.05 0.1 ‘ ’ 1 



Table 3. Analysis of variance on effect of different factors to the NAQ value, male 
subjects. Vowel positiou omitted. 



Df Sum Sq Mean Sq F value Pr{> F) 



Emotion 4 

Take 1 

Emotion:Take 4 


0.0316 

0.0069 

0.0014 


0.0079 9.37 

0.0069 8.25 

0.0003 0.40 


4.7e-07 *** 
0.0044 ** 
0.8073 


Signif. codes: 0 0.001 


0.01 0.05 0.1 ‘ ’ 1 



Figure 5 shows the results of the Tukey’s honestly significant difference test. 
It can be seen that for both genders combined, there are significant differences 
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Table 4. Analysis of variance on effect of different factors to the NAQ value, female 
snbjects. Vowel position omitted. 

Df Snm Sq Mean Sq F value Pr{> F) 

Emotion 4 0.1176 0.0294 22.90 1.2e-15 *** 

Take 1 0.0009 0.0009 0.72 0.40 

EmotioniTake 4 0.0078 0.0020 1.52 0.20 

Signif. codes: 0 0.001 “**’ 0.01 0.05 0.1 ‘ ’ 1 



between all emotions except neutral-angry, neutral-joy and tender-sadness. This 
result is consistent with the qualitative inspection of Fig. 4. In males, only four 
out of ten emotion pairs differ significantly, with the differing emotions lying at 
the opposite ends in the box plot in Fig. 4. In females, seven out of ten pairs 
differ significantly. Also this difference between the genders is consistent with 
the qualitative inspection. 



All 



males 



females 




Differences in mean ieveis of Emotion Differences in mean ieveis of Emotion Differences in mean ieveis of Emotion 



Fig. 5. Multiple comparison of different emotions with regard to their NAQ values, 
grand total and separated by genders. Bars not crossing the hashed line indicate a 
significant difference between given emotions. 
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4 Conclusions 

Samples of vocal expression of emotion in continuous speech were gathered us- 
ing simulated emotion portrayals of professional stage actors. Short /a:/ vowel 
segments of 40 ms in duration were extracted from these samples and inverse 
filtered using a sophisticated semiautomatic method. The acquired glottal flow 
was then parametrized using a robust parameter, the normalized amplitude quo- 
tient (NAQ), which describes the properties of the closing phase of the glottal 
cycle. 

Importantly, it was indicated in many points throughout this study that fe- 
males express wider variation both between emotions, but also within the emo- 
tions. This result was consistent with previous findings [28]. We have no ready 
explanation for the gender differences on the effect of recitation number. Appar- 
ently females were more consistent (or random) in their expression of emotions, 
while the male subjects either became more lax or more pressed in their voicing 
over time. 

The wide variation of NAQ values in many emotions may be explained with 
a lack of precise scenario descriptions and screening of the simulated portrayals. 
For example “sadness” can be considered to be an emotion family [29], encom- 
passing widely different emotions or even moods ranging from mild depression 
to open desperation. Also in expression of joy there was great variation, rang- 
ing from mild enjoyment to full-blown elation. Angriness however, while also an 
emotion family, did not exhibit as wide a variation in NAQ values. 

Neutral emotion was nearly always expressed with a modal voice quality. The 
NAQ values of “neutral” emotion were strongly biased towards the low end of 
the NAQ value range, with only angriness having lower mean values. 

The results of the current study are promising in showing that estimation 
of the glottal flow from continuous speech can be used for analysis of emotional 
content of speech. We were able to apply the lAIF method to inverse Altering of 
continuous speech with rather short vowel durations. The NAQ parameter could 
be acquired automatically from the inverse Altered glottal flow, and a clear con- 
nection between the NAQ values and different emotions could be shown. Most 
emotion pairs also exhibited significant differences with regard to NAQ values. 
With the development of fully automatic inverse Altering methods, NAQ could 
become a viable candidate as a feature in automatic recognition of emotions. 
Hence, the present study indicates that the application of modern inverse Al- 
tering techniques together with the NAQ parameter may aid in overcoming the 
methodological difficulties of voice quality studies applied to empirical study of 
vocal expression of emotion. 
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Abstract. In this paper different kinds of emotional speech corpora 
are compared in terms of speech acquisition (acted speech vs. elicited 
speech), utterance length and similarity to spontaneous speech. Feature 
selection is applied to find an optimal feature set and to examine the 
correlation of different kinds of features to dimensions in the emotional 
space. The influence of different feature sets is evaluated. To cope with 
environmental conditions and to get a robust application, effects related 
to energy and additive noise are analyzed. 

1 Introduction 

The first step, when building an application for recognizing emotions from 
speech, is the selection of a training corpus. Ideally, this corpus would consist of 
spontaneous emotional speech, uttered in real life, by speakers who don’t know 
that they are recorded. 

Although there are several databases which contain spontaneous emotional 
speech (e.g., the Belfast database [1] consists of audiovisual clips of 100 speakers, 
Devillers [2] and Lee [3] use speech recorded from call center dialogues), in most 
cases this approach is not practicable, as it requires the selection of emotive 
parts from huge amounts of recorded speech. As described in [4], additionally it 
is difficult to get access to existing corpora of spontaneous emotional speech. 

As a consequence, in most cases one of two substitute strategies is followed: 

Acted speech is recorded from the speech of amateur or professional actors. 
The subjects are told to imitate the emotion by trying to imagine “as if” they 
were in a certain emotional state. Acted speech is used in [5,6,7]. 

In the case of elicited speech, emotions are induced by some method like 
recalling emotional situations, playing and retelling a movie or listening to music. 
Then speech is recorded. Mozziconacci et al. [8] and McGilloway et al. [9] are 
using emotive texts to induce emotions. In [10] stress is induced in soldiers by 
recording the response to questions in front of an evaluation board consisting of 
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persons with higher military rank. A comprehensive overview on the different 
methods of corpus acquisition is given in [11]. 

Although acted speech is said to lack naturalness, it is still questionable, if 
elicited speech can give a better approximation to natural expression of emotions, 
as its quality is heavily dependent on how emotions are invoked with the subject. 
The usefulness of these two different types of corpora is going to be investigated 
in this study. 

A three-dimensional emotional space, which is spanned by the dimensions 
activation, evaluation and power is proposed by Osgood et al. [12]. As described 
in [13], in this work, the space is restrained to the dimensions of activation and 
evaluation. For this hyperspace, prosodic features are said to describe the acti- 
vation dimension while quality features correlate to the dimension of evaluation. 
This relationship is investigated further through this paper. 

For the activation classifier, 36 prosodic features related to pitch, energy and 
durational aspects have been used. 37 quality features for the evaluation decision 
are derived from the first three formants, their bandwidth, harmonicity based 
features, energy distribution (in spectral dimension and between voiced and un- 
voiced segments) and glottal fiow. The features are described more elaborated 
in [14,13]. 

There is no description of a system which deals with real life conditions in 
the literature on emotion recognizers so far. To build such applications, some 
robustness has to be introduced against environmental factors, such as changes 
in energy by different microphone distance and the impact of different kinds of 
noise. 

Different classifiers have been tested: linear classifiers, neural networks and 
support vector machines (C-SVM with RBF-kernel). As long as not stated oth- 
erwise, all results given here are obtained with SVM (for an introduction to SVM 
cf. [15]). For evaluation, leave-one-out cross validation results are taken. 

This document starts with a comparison of two corpora, one with acted 
speech, the other one with elicited speech. Additionally the results from lis- 
tening tests are taken into account. Afterwards the influence of the length of 
the utterances on classification is studied. The assumption of different features 
correlating to the emotional dimensions is explored in section 4. Finally, envi- 
ronmental influences, such as noise and microphone distance are evaluated. 



2 Wizards or Acted Speech? 

For getting a closer look on the differences between acted and elicited speech, 
two corpora are compared, one recorded in a Wizard-of-Oz environment, the 
other one by a professional actor. To get comparable results, they are evaluated 
with the same classifiers, using the same set of features. 

The elicited database, AIBO, consists of four emotions (anger, happiness, 
sadness, boredom) with 32 utterances each. A fifth state is made up of neu- 
tral recordings. To account for within-speaker variance, each utterance has been 
recorded twice on different days. As described in [13], classifiers are trained for 
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each of the two emotional dimensions activation and valence. For the activation 
classifier, angry and happy utterances are grouped to an active class and the 
bored and sad to a passive one. 64 neutral utterances where recorded twice to 
have balanced classifier training in the activation axis. 

The recordings have been done in a WOZ environment at Sony Stuttgart in 
the same way as described in [13]. Since they were made with an application on 
the AIBO entertainment robot platform in mind, they consist of short commands 
which could be given to a dog. For having better comparability with the acted 
database in terms of cultural background, recordings were done with a native 
Spanish speaker. 

The acted database, Spanish Emotional Speech (SES), is property of Uni- 
versidad Politecnica de Madrid ~ Departmento de Ingeniera Electronica - Grupo 
de Tecnologia del Habla. It contains emotional recordings in 4+1 states (angry, 
happy, sad, surprised, neutral), recorded from one male actor. The database con- 
sists of short words (0.5-1 sec length), sentences (1.8-3 sec length) and longer 
paragraphs (2-4 min length). In this paper, the short words (438 utterances, 93 
happy, 96 angry, 62 neutral, 93 surprised, 94 sad) and the sentences (210 utter- 
ances, 45 angry, 45 happy, 45 sad, 30 surprised, 45 neutral) have been used for 
evaluation. 

Firstly the differences between the two acquisition methods are examined. 
The comparison between the two corpora has been done on the activation di- 
mension. The same prosodic features as in [13] are used. Additionally the energy 
was normalized (cf. section 5). From the SES database, 345 (angry, happy, neu- 
tral and sad, angry + happy grouped to active) of the short words are taken, 
which are comparable in length to the AIBO database. 



Table 1. SES (acted speech), activation classification, 345 utterances. The rows of 
the confusion matrix show the actual emotion and the columns the perceived state. 
Last row shows the quota of correct classified utterances. As the percentage values are 
rounded, they may not always sum up to 100% 





active 


neutral 


passive 


active 


89% 


8% 


2% 


neutral 


32% 


63% 


5% 


passive 


7% 


1% 


91% 


Total: 




85% 





The classification results in tables 1 and 2 show a remarkable better classifi- 
cation rate for the acted speech than for elicited speech. 

If we assume now, that it is easier for our system to distinguish emotions in 
sentences uttered by a professional actor, the question arises, which of the two 
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Table 2. AIBO (elicited speech), activation classification, 385 utterances 





active 


neutral 


passive 


active 


91% 


6% 


2% 


neutral 


9% 


73% 


19% 


passive 


4% 


24% 


72% 


Total: 




79% 





approaches is closer to real life affective speech.^ Since it is necessary to have 
real life data in order to answer this question, we reformulate it to: how are the 
two types of speech perceived by human audience? 

Amir et. al. [17] made listening tests on four of the emotions of the SES 
database (table 3). In order to compare with the perceptive tests, a separate 
experiment was conducted with the 3+1 states anger, happiness, sadness and 
neutrality (which are the states they used for listening tests) and the prosodic 
features. Our results (cf. table 4) with the sentences from SES database are in 
several aspects very similar to the ones perceived with the listening tests (best 
rec. rate for sad utterances, big confusion between happy and neutral). 



Table 3. Listening tests on SES database (taken from [17]) 





neutral 


happy 


sad 


angry 


unknown 


neutral 


89% 


1% 


1% 


4% 


4% 


happy 


17% 


75% 


1% 


1% 


5% 


sad 


1% 


0% 


90% 


1% 


4% 


angry 


0% 


1% 


3% 


89% 


7% 



Table 4. Classifying 4 emotions from the sentences of SES using prosodic features 





neutral 


happy 


sad 


angry 


neutral 


80% 


17% 


3% 


0% 


happy 


13% 


82% 


4% 


0% 


sad 


2% 


2% 


96% 


0% 


angry 


2% 


4% 


2% 


91% 


Total: 






88% 





^ It is far beyond the scope of this text to light the differences between affect, emotion 
and mood (cf. e.g. [16]) and their correlates in speech. This means, we are using the 
these terms synonymously, but refer always to very short emotional states. 
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For the AIBO database, listening tests on five emotional states (table 5) give 
worse results than the listening tests for the SES database (table 3). This is a 
similar relation as for the automatic classifier. 



Table 6. Listening tests on Aibo database 





angry 


happy 


sad 


bored 


neutral unknowr 


angry 


72% 


10% 


0% 


0% 


17% 


1% 


happy 


9% 


56% 


0% 


0% 


33% 


2% 


sad 


0 


0 


93% 


3% 


3% 


0 


bored 


2% 


0 


13% 


33% 


52% 


0 


neutral 


6% 


8% 


9% 


3% 


74% 


0 


Total: 






66% 









The results suggest, that elicited emotions are harder to distinguish than 
acted ones, for both, humans and machines. While [4] states, that eliciting emo- 
tions has the advantage to give control over the stimulus, the results from auto- 
matic classification and listening tests illustrate the problem, that elicited speech 
can lead to recordings which are not emotive enough. 

3 How Length Influences the Classifler 

It should be easier to recognize emotions from longer utterances, than it is for 
short ones. Accordingly, the influence of the length of the utterances on the 
recognizer was examined. Classification results of the sentences (1.8-3 sec) of 
SES have been compared with the ones obtained from short words of SES (0.5-1 
sec). Here all classes from the SES database are used (anger, happiness, surprise, 
sadness and the neutral class), together with the prosodic features. 

Table 6. Classifying five emotions with prosodic features for sentences of 1.8-3 sec 
length from SES 





happy 


angry 


neutral surprised 


sad 


happy 


76% 


2% 


16% 


2% 


4% 


angry 


7% 


82% 


4% 


4% 


2% 


neutral 


13% 


0% 


83% 


0% 


3% 


surprised 


4% 


4% 


0% 


89% 


2% 


sad 


2% 


0% 


2% 


0% 


96% 


Total: 






85% 







This comparison of utterances of different length from the same speaker shows 
a drop to 80% for short words, compared to 85% for longer utterances (tables 6 
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Table 7 . Classifying five emotions with prosodic features for words of 0.5-1 sec length 
from SES 





happy 


angry 


neutral surprised 


sad 


happy 


65% 


15% 


14% 


2% 


4% 


angry 


13% 


76% 


3% 


4% 


4% 


neutral 


16% 


3% 


77% 


0% 


3% 


surprised 


6% 


0% 


0% 


92% 


1% 


sad 


4% 


4% 


3% 


0% 


88% 


Total: 






80% 







and 7). Since it is easier to emphasize emotions in longer utterances than in short 
words, this outcome was expected. Also, Fq detection becomes more inaccurate 
with shorter utterances, which influences our feature calculation. 

4 Correlation between Featnres and Dimensions in 
Emotional Space 

For the following experiments we focus on the AIBO database (elicited speech), 
since it is more suitable for our purpose of testing in real environment. 

In previous studies [13,14], it is assumed, that prosodic features correspond 
to the arousal dimension, while quality features describe evaluation dimension. 
This assumption is examined now more in depth. 

The features, that have been used, are: 

1. 36 prosodic features, related to pitch, energy and voiced/unvoiced decision 

2. 37 voice quality features, describing the first three formants, their band- 
widths, harmonic to noise ratio, energy distribution in the spectrum and 
between voiced and unvoiced parts and glottal flow. 

First feature selection by means of scatter matrices [18] was applied to the 
union of the prosodic and the quality features. With the resulting subsets of 
the whole feature set, classifiers were trained. The classification results for the 
activation dimension are given in Figure 1. 

We can see, that there is an optimal feature set for the activation dimension 
with 17 features. 

Table 8 shows the ranking of the best ten features for each dimension, ac- 
cording to feature selection. These results support the assumption of different 
features related to different dimensions insofar, as more quality features are high- 
ranked for evaluation than prosodic features and vice versa. However, there are 
also some prosodic features high ranked in the evaluation decision and quality 
features in the activation dimension. This might be used in further classifiers, as 
most prosodic features are more reliable to calculate than the quality features. 
On the other hand, the results of feature selection, and thus the validity of par- 
ticular features seem to depend a lot on how the speaker is expressing emotions: 
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Fig. 1. Recognition rate for subsets of the whole feature set (prosodic + quality fea- 
tures) applied to activation classification. Vertical axis is cut at 70% 

Table 8. Top 10 of 71 features. Feature selection with scatter matrices and forward 
selection on AIBO corpus. Voice quality features are marked bold, prosodic features 
in normal text face. 



Evaluation Activation 

(angry/happy) 

energy variance 



spectral tilt after inverse Al- 
tering 

Fo at the first voiced frame 

Maximum of Fq derivative 
Mean bandwidth of F 2 
Band energy from 2.5— 3.5 kHz 

linear regression coefficient of en- 
ergy curve 

Difference between Arst and 
second harmonic after inverse 
Altering 

Band energy from 0 kHz to Fq 

mean of second formant 
standard deviation of Fi 



mean square error of linear regres- 
sion of Fo 

length of longest unvoiced region 
Fo variance 
Fo minimum 
number of voiced regions 

Maximum HNR 



mean bandwidth of Fi 
standard deviation of Fi 

range of Fi 



the SES corpus achieves almost 80% recognition rate on five emotions only with 
the prosodic features (table 7). 

For comparison of different features sets, also experiments with an alternative 
feature set, containing quartiles, inter-quartile range, local derivative and range 
and the series of minima and maxima of pitch, intensity and intensity below and 
above 250 kHz, as suggested by [5], have been done. To eliminate environmental 
effects, energy was normalized, as described in section 5.1. The results obtained 
with the AIBO corpus are very similar to the results with the feature set from 
[13] (tables 2 and 9). 
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Table 9. Classifying activation with a different feature set on AIBO corpus 





active 


neutral 


passive 


active 


93% 


6% 


1% 


neutral 


7% 


73% 


20% 


passive 


3% 


28% 


69% 


Total: 




79% 





5 Environmental Aspects 

No real life applications are described in the approaches to emotion recognition 
from speech, that appeared recently. While investigating such an application, 
one faces several “real world problems” which are determined by environmental 
influences. A demo system has been built, in which the feature calculation and 
the trained classifiers are implemented. The system works in push-to-talk mode 
and is also able to play speech files and then to classify them. This mode was 
used to evaluate the environmental conditions. 



5.1 Intensity 

Intensity contour is an important feature for emotion recognition. Nevertheless, 
the raw energy contour has the drawback, that it is dependent on several factors, 
such as the microphone distance, the properties of the microphone and properties 
of the speaker’s voice. This makes normalization inevitable when going to real- 
life applications. In a first approach the simplest way of energy normalization 
was applied: In the case of the prosodic feature set derived from [13], where 
features are related to absolute energy, the absolute energy is divided by the 
energy mean of each utterance. For saving computational costs, not the energy 
vectors are normalized, but the calculated features. 

In the feature set from [5], the energy mean of each utterance is subtracted, 
as the energy here is on a logarithmic scale. 

To see results of normalization, the classifier has been tested in the demo 
system with different volumes. Table 10 shows recognition rates for increasing 
intensity with non-normalized and normalized energy on AIBO corpus, with 
prosodic features in activation dimension. 

We can see two effects of the normalization: 

1. The robustness of the classifier against volume changes increases a lot. Ta- 
ble 10 shows, that energy normalization is mandatory. With the highest 
volume, a recognition rate about chance is obtained without normalization. 
With normalization, the recognition rate is decreasing much less. 

2. On the other hand, energy normalization in noisy environment (cf. section 
5.2) decreases the recognition rate a lot. 
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Table 10. Online test results with normalized vs. non-normalized energy for increas- 
ing intensity (AIBO corpus, clean speech, three classes (activation dimension)), linear 
classiher 





h 


7o -f 5i 


7o + S 2 


lo + S 3 


not normalized 


73% 


54% 


35% 


33% 


normalized 


71% 


73% 


63% 


60% 



5.2 Noise 

It is known from speech recognition, that noise is influencing the behavior of 
recognizers. We can distinguish between added noise, which is produced by other 
sources than the speech itself and convoluted noise which is determined by the 
room conditions. 

There are several approaches in speech recognition to make recognizers more 
robust against the influence of noise (for an overview cf. [19]). The importance of 
training and testing under similar conditions in speech recognition is described 
in [20], where tests with matched and contaminated data have been carried out. 

To see the influence of additive noise, contaminated data was produced by 
adding conference noise with a constant SNR of 20 dB to the utterances of the 
AIBO database. For evaluation, the prosodic features were used to classify in 
the activation dimension. To see the effects of noise solely, energy normaliza- 
tion was left out first. The recognition rate, which was about chance for noisy 
speech tested on clean speech increased to 70.73% when training and testing on 
contaminated speech (table 11). These results correspond to similar results in 
speech recognition, as described in [21]. 



Table 11. Training and testing contaminated speech without normalizing energy, 36 
prosodic features, activation dimension 





active 


neutral 


passive 


active 


80% 


14% 


6% 


neutral 


14% 


59% 


27% 


passive 


3% 


24% 


74% 


Total 




71% 





Together with energy normalization, recognition rate goes down to 61% (ta- 
ble 12). The reasons are probably, that Fo“Calculation is more difficult with added 
noise, which makes energy more important in this case. When absolute energy 
mean is taken out, an important feature is lost. Additionally, the speaker from 
AIBO DB (which was used for adding noise) seems to express emotions mainly 
by energy. 
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Table 12. Training and testing contaminated speech with normalized energy, 36 
prosodic features, activation dimension 





active 


neutral 


passive 


active 


72% 


20% 


9% 


neutral 


17% 


54% 


29% 


passive 


11% 


31% 


58% 


Total: 




62% 





6 Conclusions 

In this paper, the differences between two data acquisition methods have been 
studied: emotion induction and acted speech. The results with automatic clas- 
sification and listening tests from section 2 suggest, that acted speech seems to 
be easier to classify for both, humans and automatic emotion detection systems. 
Several studies prefer elicited speech over acted speech. This might be driven 
by practical reasons, as well as the assumption, that elicited speech is closer 
to spontaneous speech in terms of emotion expression. However, since actors 
are supposed to exhibit their emotions, their speech is easier to classify. On the 
other hand, there is a lack of control over the induced emotions in elicited speech, 
which means, that it is questionable, if they are more reliable than acted speech. 

Additionally, the variance between different feature sets is negligible, com- 
pared to the differences between acted and elicited speech and also compared 
to the difference between long and short utterances. Thus, for future work it is 
recommended to exercise care on data acquisition. An interesting task would be 
a comparison between corpora of spontaneous, elicited and acted speech with 
the same feature set and classifiers. 

The relevance of the emotional space, described by [12], for emotion recog- 
nition from speech is backed by our feature selection results. Additionally we 
could show, that smaller subsets of mixed features from both classes (prosodic 
and quality features) can give reasonable results. Subsequent research should 
investigate, which feature subsets are useful for speaker independent emotion 
recognition. 

Energy related features are important for the detection of some emotional 
states. Therefore, an automatic emotion detector from speech should use this 
information. However, the measured loudness depends also on the distance of 
the speaker to the microphone. Possible workarounds are methods to localize 
the speaker by using several microphones and normalize by the distance or the 
use of close talk microphones. But these are restricting the range of possible 
applications more than necessary. In contrast, energy normalization allows the 
use of emotion detection systems in different environments. 

Noise is also influencing the performance of the classifier. As seen in section 
5.2, techniques from speech recognition like training with contaminated data, 
can compensate to some extent for the influence of additive noise. Since there is 
a degradation when applying energy normalization together with contaminated 
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data, possible future research fields include feature adaptation to both, additive 
and convoluted noise. 
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Abstract. The detection of emotion is becoming an increasingly important field 
for human-computer interaction as the advantages emotion recognition offer 
become more apparent and realisable. Emotion recognition can be achieved by 
a number of methods, one of which is through the use of bio-sensors. Bio- 
sensors possess a number of advantages against other emotion recognition 
methods as they can be made both inobtrusive and robust against a number of 
environmental conditions which other forms of emotion recognition have 
difficulty to overcome. In this paper, we describe a procedure to train 
computers to recognise emotions using multiple signals from many different 
bio-sensors. In particular, we describe the procedure we adopted to elicit 
emotions and to train our system to recognise them. We also present a set of 
preliminary results which indicate that our neural net classifier is able to obtain 
accuracy rates of 96.6% and 89.9% for recognition of emotion arousal and 
valence respectively. 



1 Introduction 

The use of emotion in computers is a field which is becoming increasingly en vogue 
at the current time. In many ways emotions are one of the last and least explored 
frontiers of intuitive human-computer interaction. This can perhaps be explained by 
the fact that computers are traditionally viewed as logical and rational tools, 
something which is incompatible with the often irrational and seeming illogical nature 
of emotions [6]. It is also apparent that we as humans, while extremely good at feeling 
and expressing emotions, still cannot agree on how they should best be defined [20]. 
These reasons are then topped by the fact that emotion recognition is itself a 
technically challenging field. 

Yet there are a number of good reasons why emotion is a fitting topic for Human- 
Computer Interaction research. Humans being emotional creatures should 
theoretically be able to interact more effectively with computers which can account 
for these emotions. This can take place in at least three ways: 

First of all, computers which understand the emotions of their users would have the 
potential to take appropriate courses of action, which otherwise would not be realised. 
For example, reducing music volume in times of extreme stress, or suggesting 

E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 36^8, 2004. 
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suitable films according to the mood of the viewer. The computer could also learn to 
recognise the particular likes and dislikes of its user and anticipate appropriate actions 
in an unobtrusive manner [8], Understanding that its user likes soothing music on 
returning from work, the computer could automatically ensure that such music is 
playing as the user enters the house. In fact, the computer might even be more 
informed in this respect than the user who may plausibly not appreciate this fact 
themselves. As such emotion-recognising computers would become more like 
personal companions than the mere tools that they are today. 

Secondly, it has been demonstrated that interaction between human and computer is 
largely social [7]. As emotions are a key component of human social interaction, there 
is a need for computers to understand this component in order to be perceived by 
users as truly effective in communication. This is under-pinned by the fact that as 
Nass points out, the various components of communication are generally combined 
together by humans. Nass also points out that the expectations of users with regard to 
the communicative abilities of computers is set to increase as interfaces appear more 
sophisticated. A highly realistic avatar with stunningly realistic speech synthesis will 
therefore be expected to have similarly sophisticated powers of speech recognition 
and understanding. As emotion is an essential aspect of human-human 
communication it will inevitably be expected that such computers will be able to 
recognise the content of spoken communication in its entirety, inclusive of the 
emotional component. 



Thirdly, emotions can also be used as an additional interaction modality to improve 
communication effectiveness in the same way as the additional use of speech or 
gestures. Just as the use of gesture can help dis-ambiguate the meaning of a particular 
utterance (e.g. “put that there”), so too can emotion be used to disambiguate the 
meaning of verbal utterances such as for example when sarcasm is used (e.g. “Great, 
this is all I need!”). 

Being inherently multi-modal, there are a number of ways in which emotions can be 
recognised. This may be via speech, facial expression, gesture and / or a variety of 
other physical and physiological cues. This spread of modalities across which 
emotion is expressed leaves the field open for many different potential recognition 
methods. Two of the more researched methods include speech [12,14] and face 
recognition [15]. In this paper we focus on an alternative and lesser-used method, 
namely the use of bio-sensors. Although some research has been conducted in this 
field e.g. by Picard, [6] Healy [9] or Christie [5], the areas of speech and face 
recognition are far more explored. 

The use of bio-sensors as a means of recognising emotion has a number of 
advantages. The size of such sensors is rapidly decreasing to the extent that it should 
soon be a routine matter to incorporate them into jewellery (e.g. ring or bracelet) or 
clothing. Indeed this is already possible in many cases. [19] This method of 
measuring bio-signals and thus emotions might be anticipated as being less disturbing 
than being "watched" by a camera as is the case with facial expression recognition. 
Facial expression recognition is also difficult when light levels are low and when the 
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user is moving. Additional difficulties exist when recognising emotion from speech, 
for example in applications where users are listening to or watching music or movies. 
In such cases it cannot be expected that they would talk while watching, and even if 
this was the case, severe problems currently exist in recognising and isolating speech 
from ambient background noise [17]. 

There are therefore strong arguments for developing systems which can recognise 
emotions from bio-signals, and which can ultimately use a combination of methods 
(including speech and expression recognition) to capitalise on the advantages of each 
method. In this paper we will focus on the issues surrounding the recognition of 
emotion from bio-sensors and then describe a methodology we have adopted to both 
elicit and recognise emotions such that they can be recognised in future. 



2 The Bio-signals 

When we are frightened our heart races, our breathing becomes rapid, our mouth 
becomes dry, our muscles tense, our palms become sweaty, and we may want to run. 
These bodily changes are mediated by the autonomic nervous system, which controls 
heart muscle, smooth muscle, and exocrine glands [10]. The autonomic nervous 
system itself can be divided into sympathetic and parasympathetic divisions. Both 
operate in conjunction with each other and with the somatic motor system to regulate 
most types of behaviour, whether in normal or emergency situations. Although 
several visceral functions are controlled predominantly by one or the other division, 
and although both the sympathetic and parasympathetic divisions often exert 
opposing effects on innervated target tissues, it is the balance of activity between the 
two that helps maintain an internal stable environment in the face of changing 
external conditions, [2]. Certain emotions can affect this balance and can result in a 
wide variety of bodily reactions comparable to the ones described above. 

These bodily reactions can be monitored and measured. These signals are then 
referred to as bio-signals. What we can observe from the outside are the bodily 
reactions only. Our goal is to use these reactions and by means of special bio-sensors, 
deduce the emotional state of the user. 

We use the following set of bio-signals: 

• Electromyography (EMG) refers to the muscle activity or frequency of 
muscle tension of a certain muscle. This signal was chosen because high 
muscle tension often occurs under stress. The absolute level of the muscle 
tension however strongly depends on the muscle where it is measured. 

• Electrodermal activity -also referred to as skin conductivity (SC)- basically 
measures the conductivity of the skin, which increases if the skin is sweaty. 
This signal was found to be a good and sensitive indicator of stress as well as 
other stimuli and also helps to differentiate between conflict-no conflict 
situations or between anger and fear. The problem with this signal is that it is 
also influenced by external factors such as outside temperature. It therefore 
needs reference measurements and calibration. 
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• Skin temperature simply describes the temperature as measured on the 
surface of the skin. Since muscles are tense under strain, the blood vessels 
will be contracted and therefore the temperature will decrease. Comparable 
to the SC, the skin temperature also depends on external factors. Furthermore 
it is a relatively slow indicator of changes in emotional state. 

• Blood volume pulse (B VP) is a measure to determine the amount of blood 
currently running though the vessels, e.g. in the finger of a test subject. A 
photoplethysmograph (PPG) which consists of a light source and photo 
sensor are attached to the skin and the amount of reflected light, which 
depends on the amount of blood, is measured. B VP can be used to measure 
vasoconstriction and the heart rate. 

• Electrocardiogram (ECG). The ECG signal is the manifestation of 
contractile activity of the heart. This can be recorded either directly on the 
surface of the chest or alternatively on the limbs. Recording ECG from the 
limbs is less inconvenient but more vulnerable to artifacts. For the initial 
system we decided to use the ECG from the chest to prevent the system 
being influenced by artifacts. It can be used to measure heart rate (HR) and 
inter-beat intervals (IBI) to determine the heart rate variability (HRV). A low 
HRV can indicate a state of relaxation, whereas an increased HRV can 
indicate a potential state of mental stress or frustration. 

• Respiration sensors measure how deep and fast a person is breathing. This 
is measured by applying a rubber band around the chest. Fast and deep 
breathing can indicate excitement such as anger or fear but sometimes also 
joy. Rapid shallow breathing can indicate tense anticipation including panic, 
fear or concentration. Slow and deep breathing indicates a relaxed resting 
state while slow and shallow breathing can indicate states of withdrawal, 
passive like depression or calm happiness. 

We used a combination of these signals, to derive a set of features that can be used to 
train a neural network classifier. This is then used to automatically detect the 
emotional state of a user in terms of arousal and valence values, based on the bio- 
signals recorded. 



3 The Bio-sensors 

For measuring the different body functions described in the previous section, we used 
the sensor set ProComp-i- [1]. Their application is depicted in the following Figures. 

Figure [1] shows the sensors for measuring EMG. We chose to measure the muscle 
activity using the masseter muscle, since the muscle movement has been described as 
reliable in that location [9]. Figure [2] shows a standard ECG sensor. In Figure [3] 
the respiration sensor is shown applied to the chest while in Figure [4] the skin 
conductivity, BVP and temperature sensor can be seen applied to the fingers of the 
left hand (the non-dominant hand should be used for measuring). 
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Fig. 3. Respiration sensor Fig. 4. Skin conductivity, temperature and 

BVP sensor 

We are well aware that the current form and method of application of the bio- 
sensors is anything but intuitive and natural. However considering the current trend 
towards wearable computing, it can be expected that the bio-sensors will sooner or 
later be tiny enough to be inconspicuously incorporated into clothing, jewelery or the 
like. In this case the users, although aware that their bio-signals are being monitored, 
will not be physically aware of them. However, for research purposes we chose the 
rather bulky, cabled sensors, because they allow a certain flexibility e.g. in terms of 
placement of the sensors. This flexibility is important given the fact that many aspects 
of sensor usage are still not completely clear, e.g. which muscle is most appropriate 
for measuring the EMG signal. 

Once however, it is clear, exactly where and how the bio-signal can be optimally 
measured, special dedicated sensors can be used. It might also be argued that 
measuring some signals such as the ECG is always disturbing no matter how tiny the 
sensors are because users would have to undress to apply them to their chest. An 
important component of our research however is to find redundancies in the set of 
bio-signals which may allow for less complicated sensor arrangements to be 
developed. It may therefore for example be conceivable to derive information about 
the HR from the BVP instead of the ECG and its more inconvenient sensor 
arrangement. 
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4 Data Collection Experiments 

Having established a set of signals which may be of use for recognising emotion, it is 
then necessary to define a methodology in order to enable the system to translate the 
signals coming from these sensors into specific emotions. A number of issues arise in 
such an endeavour. Of chief importance is the fact that unlike emotion recognition 
from speech or facial expression [13], comparatively little is known about which bio- 
signals accompany which emotions. Before being able to train the system to recognise 
certain signals therefore, it is first necessary to elicit emotions and then record the 
various bio-signals that accompany them. 

An initial data collection process therefore had to be designed to allow this to be 
acheived. Designing such a process gives rise to two broad issues which need to be 
considered. The first concerns the question of how to actually elicit emotions for 
signal recording purposes. The second issue concerns the definition of a procedure 
which ensures that emotions are elicited in as valid and reliable a way as possible. 

Dealing first with the issue of how to elicit emotions: a number of options are 
possible each possessing different moral and practical aspects. Procedures with the 
most validity rely on creating a strong sense of realism in the subject. While this may 
largely be acceptable for more positive emotions, the elicitation of negative emotions 
would most likely need to rely on procedures such as causing pain or humiliation 
which are ethically questionable. In addition to the ethical aspect, such procedures are 
extremely time consuming and require a large amount of care and effort both in 
planning and execution. At the other extreme, procedures which require 
comparatively little planning such as asking subjects to act out certain emotions [9, 
11] are likely to be insufficiently real for bio-sensors. The use of bio-sensors as he 
detectors is testament to this. 

It was therefore essential that the procedure to be adopted elicited real emotions, 
but at the same time minimised the ethical and practical dis-advantages of methods 
which create the most extreme sense of reality. An ideal solution as a starting point 
proved to be the lAPS photo set [2,3]. This takes the form of over 800 photographs 
which have been classified by a large number of participants in terms of arousal and 
valence. Arousal refers to how strong the picture content is. Valence (sometimes 
known as ‘pleasure’) refers to how positive or negative the content is considered to 
be. The lAPS photoset therefore possessed a number of advantages from the outset in 
that it was validated, and classified in terms of two simple but relevant dimensions. It 
therefore represented an appropriate first step for our classifier. The photoset was also 
amenable to a laboratory setting which was necessary in order to screen out any 
confounding variables such as physical exertion, visual or audio distractions which 
might influence the sensors. 

Having decided on the elicitation materials, it was then necessary to decide upon a 
procedure which enabled the elicitation of emotions in such a way that they were 
neither biased by the original mood of the participant nor by emotions elicited early 
on in the experiment. To achieve this, we adopted a procedure whereby participants 
would rest for 15 minutes before beginning the experiment. Each participant was 
then shown a set of 10 low arousal and neutral valence photographs for 15 seconds 
each in order to try and normalise their mood. After that participants were shown a set 
of 5 low arousal positive and a set of 5 low arousal negative photos, followed by 
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medium arousal positive and negative arousal photos, followed by high arousal 
positive and negative photos. In between each set of 5 photos where either the arousal 
or valence changed, participants were shown a set of low arousal, neutral valence 
photos in order to normalise their mood once more. 

We adopted a progressive increase in arousal level in order to minimise the 
disturbing effect which the content of any one photograph might have on subsequent 
photographs. We also ensured that the high arousal, negative valence photographs 
were shown last as pilot studies had shown that the content of the photographs (e.g. 
mutilations and body parts) was so disturbing that it was difficult for subjects to 
return to a normal resting state after seeing them. Participants also subsequently 
anticipated seeing a disturbing photo every time a new photograph was shown which 
affected the sensor readings. The procedure therefore represented a necessary 
compromise but proved to deliver useful data from which it was possible to derive a 
preliminary set of bio-signal features. The manner in which features were extracted 
from these signals is now reported on in the next section. 



5 Feature Extraction 

In order to derive features from the various bio-signals, we use a common set of 
feature values which are processed and used as an additional input or as a substitute to 
the raw signal for the classification. These common feature values are: 



Running mean; The running mean computes a vector of mean values over time for a 
specific input signal by using a large rectangular window that is shifted across the 
feature vector. The size of the rectangular window depends on the input signal. This 
makes it possible to distinguish between phasic, fast changes and tonic, slow moving 
components in the analysed Signals. 



Running standard deviation: This procedure calculates the local standard deviation 
of the signal and results in a vector that could be described as the activity and changes 
in the current signal. 



Slope: The slope is just a simple approximation of the first derivative and therefore 
indicates fast changes in the sensory data. Filtering is often necessary before using 
this procedure as input signal noise may disturb the results. 

Having described these common feature values, the precise manner in which each 
of the bio-signals are treated will now be described: 



5.1 ECG 

In order to process the ECG features we first subtract the global mean value of the 
raw signal that might be shifted up or downwards by the recording device. The signal 
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is filtered with a low pass filter with f.=90Hz to remove high frequency noise, a very 
sharp high pass fc^O.SHz to eliminate low frequency drifts that occur when the subject 
is moving and then a notch filter aX f„=SQHz that removes influences of the power 
line. We detected QRS complexes by the following Algorithm (1). This is a 
derivative based method for detecting QRS complexes of the ECG signal [1] 
consisting of a weighted sum of the smoothed first derivative and the approximated 
second derivative. 

y{n) = {a ■ \x{n) - x{n - 2)|) + ■ \x{n) - 2x{n -2) + x(n - 4)|) (1) 

where x(n) is the input signal, a and are the weight parameters for the balance 
between the smoothed three point first derivative and the approximated second 
derivative and y(n) is the output of this stage, a and (3 are set to 1.3 and 1.1 according 
to [1]. A further smoothing of the result is obtained by the introduction of an M-point 
moving average filter (2) . 

Y M-\ 

yfiiM) = — Y,yi^-J) ( 2 ) 

M y=o 

The resulting signal is scanned with an adaptive threshold that is defined by (3) 



thres(n) = 



max[n -k,n + k]- min[n -k,n + k] 
a 



(3) 



The result is a list of samples each indicating the onset of a single QRS complex. 
This list is used to calculate a set of different features like heart rate (HR), heart rate 
variability (HRV) and the inter beat interval (IBI) between consecutive heart beats. 
The HRV for example is influenced by the sympathetic and parasympathetic vagus 
nerve and therefore a good indicator for the temporary dominace of one of them. 



The HR is calculated simply by using the difference between two consecutive 
detected QRS complexes {Ihb) 



HR{n) = 



1 



fs 



(4) 



However in order to make it more stable against artefacts and irregularities of the 
heart function we introduce a logic that calculates the mean duration of the previous 
IBIs and checks if the next heart beat occurs within a range of +/- 15 %. 

If this constraint is violated, then we assume an out of range increase of the heart 
rate and correct the signal by introducing a beat at the expected position. This 
prevents missed QRS complexes or anomalies from influencing the HR. The HR itself 
is used as a feature as well as the previously mentioned common features derived 
from it. 
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5.2 BVP 



This BVP waveform is potentially very useful in extracting a set of useful features for 
emotion recognition. It is possible to extract the HR by a procedure that is similar to 
the extraction of the HR from the ECG signal. This is done by subtracting the global 
mean, the differentiation and the following: 



Y{n) = {X{n) + mm{X)y 

, . . msLy\n — k,n + k\ 

thres{n) = ^ - 



- min 

a 



\n- k,n + k\ 



( 5 ) 

( 6 ) 



After that we also scan the signal if it exceeds the computed threshold. 

The process which follows is identical to that used in the ECG feature processing. 
As you might guess the HR is the same as that derived by the ECG, however it is not 
always possible to use the ECG, on the other hand the BVP Sensor sometimes 
delivers too many artefacts. It might therefore be better to have a redundant or 
complimentary system. 

Another value which can be calculated from the BVP is the vascular dilatation in 
the limbs. This Feature which is also known as the ‘pinch’ of the BVP is the complete 
amplitude of the raw signal. It is calculated by shifting a rectangular window over the 
signal. The window is analysed according to the maximum and minimum values. The 
difference between the maximum and the minimum value vector is the resulting pinch 
vector. 



P{n) = max[« -k,n + k\- min[« -k,n + k\ (7) 

The shape of the BVP signal is subject to change according to the workload of the 
heart [4]. For example the dichotic notch signal in a resting person can clearly be 
recognized and suppressed during exercise. We have also implemented an algorithm 
that extracts a section of 16 detected heart beats and normalises it according to the 
min/max values. We also computed a histogram with 100 bins and analysed the ratio 
between the sum of the interval [40,70] and [70,100]. 



5.3 Skin Conductivity 

The skin conductivity consists of two separate components. There is a slow moving 
tonic component that indicates a general activity of the perspiratory glands from 
temperature or other influences and a faster phasic component that is influenced by 
emotions and the level of arousal. We process three features from this signal: Firstly, 
a raw file is normalised and filtered with a low pass filter with fc= 4Hz and the 
common features like slope mean and standard deviation. Two features are then taken 
from [4] 
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X-min(X) 
max(X) - min(X) 



( 8 ) 



where a time varying local sample mean is obtained by windowing and subtracting 
the resulting mean from the original signal. 



gn{n) = g{n) 



1 

N 



A^-1 

Y,Sin-k) 



k=0 



( 9 ) 



The second measure is a time varying estimate of the local variance of the signal. 






1 



N-\ 



N-l 

z 



k=0 



g(n - k) 



V 



1 

N 



N-\ 

z 



/=0 



g(n-l)\ 



J 



( 10 ) 



5.4 EMG 

The EMG signal is smoothed by a moving average filter with a length of 128 points 
and then processed using the common features in a similar manner as with the other 
signals already described. 



5.5 Respiration 

Respiration was recorded using a rubber belt around the chest. The processing was 
performed in the same way as with the basic features of the BVP. In this case 
however the amplitude ratio was left out. 



5.6 Temperature 

Temperature was recorded and the common features processed directly from the raw 
signal. 



6 Classification Experiments 

After having extracted the features as described in the previous section, we then 
trained a statistical classifier, with the goal of learning the corresponding emotion for 
a set of features with which it is presented. There are different options for building 
such a classifier. Fernandez [16] for example used FIMMs, while Healy [9] used 
Fisher linear discriminant projection We chose to use a neural network that was 
trained and tested as described in the next section. Prior to training and testing all 
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features were normalised into the range between [0,1] before being converted into 
pattern fdes for the neural network simulator [18] : 



X-min(X) 
max(X) - min(X) 



( 11 ) 



6.1 Neural Network Classifier 

The data for the experiment was gathered from a single subject on different days and 
different times of the day. Each pattern vector represents a frame of 2 seconds length. 
The target values that represent valence and arousal were extracted from the lAPS 
fdes [2,3], normalised to values between [0,1] and appended to the patterns that were 
recorded during the presentation of the corresponding picture. The dataset was then 
divided into three different sets for training, validation and testing. For the current 
experiments we extracted 1000 patterns with 700 patterns for training, 150 patterns 
for testing and 150 patterns for validation. 

We trained two separate networks for valence and arousal each with an input 
vector of length 13, a hidden layer with 10 neurons and a single output node that gives 
an estimation of the corresponding valence or arousal value. For a first approach we 
decided to use a limited set of input features (features set) only to get a basic feeling 
for the features and their influence and importance to the quality of the estimator. 

This feature set included the following features. Heart rate (std), BVP heart rate 
(std), BVP amplitude (std), EMG amplitude, skin conductivity (std), respiration 
amplitude (std), respiration rate (std). The logistic function was selected as the 
activation function for the hidden and output layer and the training was done with 
resilient propagation as the learning function. 



6.2 Results 

The independent test patterns were passed through the trained network and the results 
were analysed with the band function that counts a pattern as correct if the estimated 
and the target value lie within a range of a specified distance that we will refer to as 
bandwidth. This is shown in the table below: 



Table 1. Classification results 





Arousal 


Valence 


Bandwidth 


0.1 


0.2 


0.1 


0.2 


Correct 


89.73% 


96.58% 


63.76% 


89.93% 


Wrong 


10.27% 


3.42% 


36.24% 


10.07% 



The results show that the estimation of the valence value is a much harder task than 
the estimation of arousal. If we only allow a bandwidth of 10% for the output to be 
counted as correct, we achieve 89.7% correct classification for arousal and 63.8% for 
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valence, indicating that valence is more difficult to detect from the signals. However, 
by increasing the allowed bandwidth to 20%, we can achieve 96.6% correct 
classification for arousal and 89.9% for valence, which is a highly acceptable result. 
This shows that it is possible to provide a good estimation of the mood which a 
participant is feeling (in terms of the likely section of the valence and arousal space) 
for a specific test person. Even though valence seems to be much harder to classify, 
89.9% is still a good result and echoes the fact that emotion recognition from speech 
also works relatively well for arousal but not so well for valence [12]. 



7 Conclusion 

This paper described a method of measuring emotions in humans which is still 
relatively unexplored, namely using bio-sensors. We adopted this approach because 
we hope that it could be a supplement for other emotion-related modalities as well as 
a good alternative in situations where they are not applicable. We first described how 
we designed experiments to elicit emotions before describing how we measured and 
processed them. Finally we described our neural net classifier and the recognition 
rates for arousal and valence that we achieved on data from a specific test subject. 

There are clearly more steps to take in this research area. First of all, we aim to 
improve the accuracy of the network. This could be achieved either by adding some 
meaningful features (i.e. the full feature set) that are computed from the existing 
sensor signals or by using different bio-signals such as EEG data. There is also much 
scope to expand our system to incorporate other means of emotion recognition. 
Currently we are working on a facial expression system which can be integrated with 
bio-signal features. In addition we are working on a system to detect the eye-blinking 
frequency as this might be a good indicator of the amount of attention test subjects 
pay to their current tasks. 

Our current system is set up to recognise arousal and valence values. While this 
goes part of the way towards achieving emotion recognition, it is still some distance 
from being able to recognise distinct emotions, especially given the fact that different 
emotions may have similar or identical valence or arousal values. 

The emotion recognition results which we have described are also taken from one 
subject only. An important requirement of emotion recognising systems is that they 
do not require high levels of training for each user, as this would simply not be 
practical in the real world. It is essential therefore that subject-independent classifiers 
can be developed. This is a key goal of our research. 

As a final word, we would emphasise that while the use of bio-signals is a valuable 
means of recognising emotion, it is currently questionable as to whether it is sufficient 
on its own. In our view the combination of different sources on top of bio-sensor 
signals, such as video analysis, motion detection or emotion recognition from speech 
is a necessary step to avoid the limitations of single modality systems. 
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Abstract. In this pilot study, a neural architecture for temporal emotion recogni- 
tion from image sequences is proposed. The investigation aims at the development 
of key principles in an extendable experimental framework to study human emo- 
tions. Features representing temporal facial variations were extracted within a 
bounding box around the face that is segregated into regions. Within each region, 
the optical flow is tracked over time. The dense flow field in a region is subse- 
quently integrated whose principal components were estimated as a representative 
velocity of face motion. For each emotion a Fuzzy ARTMAP neural network was 
trained by incremental learning to classify the feature vectors resulting from the 
motion processing stage. Single category nodes corresponding to the expected 
feature representation code the respective emotion classes. The architecture was 
tested on the Cohn-Kanade facial expression database. 



1 Introduction 

The automated analysis of human behavior by means of computational vision techniques 
is a research topic that gained increased attention. Several approaches were proposed. 
For example, Mase [1] utilized the Facial Action Coding System (FACS) to describe 
expressions based on the extracted muscle motions. Bascle et al. [2] tracked facial de- 
formations by means of face templates generated from B-spline curves. Key-frames were 
selected to represent basic face expressions. Most similar to our own approach, Essa and 
Pentland [3] extracted spatio-temporal energy of facial deformations from image se- 
quences that define dense templates of expected motions. Observed expressions of a 
human face were classified according to the most similar average motion pattern using 
a Bayesian classifier. 

Unlike previous approaches, we propose a neural network architecture that aims at a 
framework for emotion recognition based on integrated velocities (amount and direction 
of motion) in different sectors of a human face. We introduce a simple framework for fast 
incremental neural network learning to classify different emotions. The architecture is 
extendable to serve as a tool of experimental investigation. For example, the architecture 
is flexible to allow the incorporation of features that represent temporal coordination 
of emotions. In this pilot study, we utilize a supervised principle of incremental allo- 
cation of categories to represent different emotions. We evaluate the proposed network 
using a database of image sequences from facial expressions [4] and demonstrate the 
discriminative power of the network. 



E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 49-52, 2004. 
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2 Framework for Temporal Emotion Classification 

2.1 Extracting Facial Features 

In order to pursue the analysis of facial expressions, we calculate optical flow features 
using a mechanism of local 2D motion integration proposed by Lucas and Kanade [5]. 
In order to reduce the dimension of the resulting optical flow data, the first frame in 
a sequence is labeled manually by drawing a bounding box around the face ranging 
from the top of the eyebrows down to the bottom of the chin. The bounding box is 
then subdivided into tiles of possibly varying size. In the first approach, we segregate 
the bounding box into two halves by a central (almost) vertically oriented symmetry 
line through the nose region. Horizontal separator lines that were manually adjusted to 
pass through the pupils and through the mouth further segregate the two halves. This 
finally leads to six facial regions Rk (see Fig. 1, left). The second approach divides the 
bounding box into rectangles of equal size, R^, irrespective of facial components and 
their orientation. 

In order to minimize compute time, the Lucas-Kanade algorithm is first applied on 
an equidistant grid Gq = {gg, gj, . . . , g° } in the first image pair of a sequence. This 
grid is warped by using the calculated flow vectors. The gray levels of new resulting pixel 
set Gi = {gj, gj, . . . , g^} is taken as input data for flow estimation in the next image 
pair. The difference Uy = g^~^ — g° represents the optical flow for corresponding 
grid points of the image pair (/y, Jj+i). The optical flow in a whole sequence is finally 
described by G = (Gg, . . . , Gm- 2 )- 

To represent the optical flow of facial deformations, a feature vector is calculated 

in each region Rk- Flow estimates must be integrated over a sequence of images of 
variable length. In order to extract flow features invariant to sequence length, a vector 
of tracked motion is calculated for each grid-point by Uj = gj^~^ — g° ■ For all vectors 
within a face region Rk (or R^) we apply a principal component decomposition (PCA) 
for data reduction. Finally we project principal component 

which leads to individual feature vectors Wfc and the feature set Fi = (wi , . . . , Wg) (or, 
F 2 for the equi-size tessellation) (see Fig. 1 right). 




Fig. 1. Left: Tracked flow 
vectors on an equidistant 
grid. Right: Spatial integra- 
tion of tracked flow vectors 
leading to the six feature vec- 
tors in feature set Fi . 



2.2 Learning Emotions with Fuzzy ARTMAP 

Feature vectors coding different emotions are classified using a Fuzzy ARTMAP neural 
network architecture [6]. Fuzzy ARTMAP is an architecture for supervised learning 
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composed of two Fuzzy ART networks that are linked via a map field. In order to train 
this network, the feature vector in complement coding is presented to the first ART 
module while the desired output is presented to the second ART module. Learning is 
performed utilizing a form of hypothesis testing. When the network receives a feature 
vector, it deduces the best-matching category by evaluating a distance measure against 
all memory category nodes. Using the second input the network either confirms or rejects 
the hypothesis, in which case the search process is repeated for a new category node. If 
the search process failed, new category nodes are dynamically allocated to encode the 
input. 

To test our framework, we used parts of the Cohn-Kanade Facial Expression Database 
(70 persons, [4]). For each person up to 10 sequences were available, containing 3 up to 
25 images. Each sequence represents one of the six basis emotions (surprise, happiness, 
sadness, fear, anger, and disgust). The data, however, was rather inhomogeneous in that 
it contained only few sequences for the emotions anger or disgust. Therefore, results 
were unreliable and not stable in all cases. For each of the six emotions we trained 
one Fuzzy ARTMAP in fast learning mode [6] . To get suitable test cases for network 
performance evaluation, the leave-one-out cross-validation technique [7] was used. Also, 
a simple perceptron was trained in order to investigate linear separability of the feature 
vector. 



Table 1. Error rates for test cases with Fi(F2) 





error 


N®. 


false 


false 


emotion 


rate in % 


seq. positive negative 


happin. 


11.9(11.4) 


65 


16(16) 


9(8) 


sadness 


13.8 (6.1) 


35 


9(4) 


20(9) 


surprise 


3.3 (3.3) 


54 


5(6) 


2(1) 


anger 


13.3(13.8) 


41 


18(19) 


10(10) 


fear 


7.2 (6.6) 


8 


7(6) 


8(8) 


disgust 


6.1 (5.7) 


7 


9(7) 


4(5) 



Table 2. Fi -Confusion matrix 





hapin. 

sadness 

surprise 

anger 

fear 

disgust 


happiness 


57 0 2 6 4 3 


sadness 


3 26 4 8 2 0 


surprise 


2 0 53 0 0 4 


anger 


4 3 0 31 1 2 


fear 


5 1 0 2 0 0 


disgust 


5 0 0 2 0 3 



3 Results 

Unlike the simple perceptron, all of the six trained neural networks were able to represent 
and recall the training data set without error. This indicates that the emotions were not 
linear separable. Table 1 demonstrates the error-rates for the test-cases using both feature 
sets'. All data was obtained with a fixed set of network parameters. Although feature 
set Fi was derived from manually adjusted and labeled regions, feature set F 2 obtains 
similar results using data from a simple grid of fixed sample width.^ Table 2 shows 
fhe confusion mafrix for fhe six learned emotions (achieved for the low-dimensional 

* Since we used six emotion classes derived from the original FACS labels, instead of classifying 
the FACS labels directly (for which our approach has a too low spatial resolution), we did not 
investigate a quantitative comparison with other approaches. 

^ A closer look at the results demonstrates that sadness is even better encoded in F2 . We conclude 
that we may need to increase the number of features in critical regions. 
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Fi feature set). The results demonstrate quantitatively the rate of performance along 
with the hetween-class confusions. Whether the network behavior reliably reproduces 
human confusion needs further investigation of the network performance and processes 
for feature extraction. As already pointed out above, the results obtained for fear and 
disgust, respectively, were unreliable due to the limited training data available. 

4 Summary and Further Work 

In this pilot study, we presented a framework for emotion classification based on super- 
vised neural network learning using Fuzzy ARTMAP. Our approach utilizes quantized 
optical flow measures that gain robustness through temporal and spatial integration. The 
feature representation that encodes the velocities of gross facial regions is built incremen- 
tally by allocating category nodes of the ART network. The computational architecture 
provides a testbed of further experimental investigation of processing and analysis of 
facial emotions. 

The system performance can be further increased if more detailed motion features are 
sampled in regions of higher spatial detail, e.g., around the eyes. This could be achieved 
by automatic decomposition of regions into smaller parts if the variance of movements 
in the considered region exceeds threshold. Researchers have argued that the temporal 
dynamics of the expression, rather than averaged spatial deformations, is important in 
expression recognition (e.g., [8]). The network can be extended by augmenting spatial 
features by time-code differentiated features. Instead of using a localized category repre- 
sentation, a distributed self-organizing feature map may increase the robustness of the 
network. It may further allow to investigate the topographic representation of emotions 
and to study similarities between emotions based on distance measures in the map. 
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Abstract. In this paper, we report on our efforts in developing affec- 
tive character-based interfaces, i.e. interfaces that recognize and measure 
affective information of the user and address user affect by employing em- 
bodied characters. In particular, we describe the Empathic Companion, 
an animated interface agent that accompanies the user in the setting 
of a virtual job interview. This interface application takes physiological 
data (skin conductance and electromyography) of a user in real-time, 
interprets them as emotions, and addresses the user’s affective states 
in the form of empathic feedback. We present preliminary results from 
an exploratory study that aims to evaluate the impact of the Empathic 
Companion by measuring users’ skin conductance and heart rate. 



1 Introduction 

The idea of a computer ‘sensing’ the user’s autonomic nervous system (ANS) 
activity is becoming increasingly popular in the human-computer interface com- 
munity, partly because of the availability of affordable high-specification sensing 
technologies, and also due to the recent progress in interpreting users’ physio- 
logical states as affective states or emotions [11]. The general vision is that if a 
user’s emotion could be recognized by a computer, human-computer interaction 
would become more natural, enjoyable, and productive. The computer could of- 
fer help and assistance to a confused user or try to cheer up a frustrated user, 
and hence react in ways that are more appropriate than simply ignoring the 
user’s affective state as is the case with most current interfaces. 

Our particular interest concerns interfaces that employ embodied agents, or 
life-like characters, as interaction partners of the user. By emulating multi-modal 
human-human communication and displaying social cues including (synthetic) 
speech, communicative gestures, and the expression of emotion, those characters 
may also trigger social reactions in users, and thus implement the “computers 
as social actors” metaphor [14,2]. This type of ‘social interface’ has been demon- 
strated to enrich human-computer interaction in a wide variety of applications, 
including interactive presentations, training, and sales (see [12] for a recent over- 
view). 
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In this paper, we propose an interface that obtains information about a user’s 
physiological activity in real-time and provides affective feedback by means of 
a life-like character. The interface is intended to address the user’s emotion by 
showing concern about user affect, sometimes called empathic (or sympathetic) 
behavior. Empathic interfaces may improve human-computer interaction or, put 
differently, leave users less frustrated in the case of a stressful event related to the 
interaction [5] . Potential application fields include software (assuming unavoida- 
ble software-related failures), computer-based customer support, and educatio- 
nal and tele-home health care systems. The web-based (virtual) job interview 
scenario described in this paper serves as a simple demonstrator application that 
allows us to discuss the technical issues involved in real-time emotion recognition 
as well as the implementation of an empathic agent. 

This paper also aims to illustrate two approaches to using human physiology 
to evaluate empathic embodied interfaces. Following [17], we distinguish between 
(i) the paradigm that measures short-time (<5 seconds) physiological changes in 
response to specific events, and (ii) the paradigm that performs comparisons of 
ANS readings across longer (>5 minutes) periods of time under different control 
conditions. While the first mentioned approach is applicable to situations where 
experimental conditions can be tightly controlled (see Sect. 3), interactions for 
which tight control conditions are not possible can be evaluated by following the 
latter approach (see Sect. 5). 

The rest of this paper is organized as follows. In Sect. 2, we describe work 
related to our own. Section 3 reports on the result of our previous study showing 
that empathic embodied feedback may reduce (deliberately frustrated) users’ 
level of arousal. Section 4 is dedicated to introducing the Empathic Companion. 
There, we first describe our system for real-time emotion recognition, and then 
explain how physiological signals are mapped to named emotions. The final part 
of Sect. 4 discusses the decision-theoretic agent that is responsible for selecting 
the Empathic Companion’s actions. In Sect. 5, we illustrate the structure an 
interaction with the Empathic Companion in the setting of a virtual job in- 
terview, and discuss preliminary results of an experiment that recorded users’ 
physiological activity during the interaction. Section 6 concludes the paper. 



2 Related Work 

A review of the literature suggests that it is possible to distinguish at least five 
modes of usage of a user’s physiology for (affective) interfaces: (i) A user’s physio- 
logical data can be used to track the impact of the interface on the affective state 
of the user. As shown in [17], recording users’ physiological data and associating 
them with interface events is an important methodology for testing software, 
e.g. to measure the effect of different types of web page design, (ii) A user’s ANS 
activity can be used in order to reflect (or ‘mirror’) the user’s affective state 
by means of an embodied agent. In this way, the user may gain insight into his 
or her physiological responses. This type of application has been shown to bear 
considerable relevance for tele-home care systems [8]. (iii) The user’s physiologi- 
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cal state can play a key role in selecting strategies to adapt the interface. When 
the user’s frustration is detected, an interface agent can try to undo the user’s 
negative feeling. A main application field of adaptive interfaces are tutoring sy- 
stems that aim at tailoring their behavior in accord with the student’s affective 
state and learning goal [3]. (iv) User bio-signal data can be used to address the 
user’s affective state. Major work has been done by [2] who proposes the term 
‘relational agents’ to investigate animated agents that are designed to develop 
and maintain long-term, social-emotional relationships with users. Specifically, 
he describes an agent that addresses human affect in the role of an health beha- 
vior change assistant. The Empathic Companion application illustrated in this 
paper also falls under this category, (v) A user’s physiological responses may 
become increasingly important to learn the user’s situation-dependent affective 
responses and hence allow for the acquisition of predictive user models [1]. Lear- 
ning of emotional behavior is also crucial for the previously described relational 
agents that are intended to enable fertile interactions with human users over 
extended periods of time [2] . 

The investigation of [5] is most closely related to our work on empathic 
interfaces. They describe the design and evaluation of an interface implementing 
strategies aimed at reducing negative affect, such as active listening, empathy, 
sympathy, and venting. The resulting affect-support agent used in a simulated 
network game scenario could be shown to undo some of the users’ negative 
feelings after they have been deliberately frustrated by simulated network delays 
inserted to the course of the game. The Emphatic Companion interface differs 
from the one used by Klein in two aspects. First, the user is given feedback in a 
more timely fashion, i.e. shortly after the emotion actually occurs, and not after 
the interaction session, in response to the subject’s questionnaire entries. While 
providing immediate response to user affect is certainly preferable in terms of 
natural interaction, it assumes that affect is processed in real-time. Hence, in 
order to assess a user’s emotional state in online, we implemented a system that 
takes physiological signals of the user during the interaction with the computer. 
Second, affective feedback to the user is communicated by means of a life-like 
character, rather than a text message. Although the study of Klein supports 
the argument that embodiment is not necessary to achieve social response, it 
has been shown that embodied characters may boost the tendency of people to 
interact with computers in a social way, the so-called ‘persona effect’ [16]. 



3 Empathizing with a Frustrated User 

In this section, we want to mention the main result of a previously conducted 
study that had the purpose of evaluating the effect of empathic embodied feed- 
back on deliberately frustrated users [13]. The impact of a life-like character’s 
empathic response was measured by comparing the skin conductance readings of 
subjects that received empathic feedback with the skin conductance of subjects 
that did not. A simple mathematical quiz game was implemented where subjects 
are instructed to sum up five consecutively displayed numbers and are then as- 
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Fig. 1. Shima character: “I apologize that there was a delay in posing the question.” 

ked to subtract the i-th number of the sequence {i < 4). The instruction is given 
by the“Shima” character, an animated cartoon-style 2D agent, using synthetic 
speech and appropriate gestures (see Fig. 1). Subjects compete for the best score 
in terms of correct answers and time. Subjects were told that they would inter- 
act with a prototype interface that may still contain some bugs. This warning 
was essential since in some quiz questions, a delay was inserted before showing 
the 5th number. The delay was assumed to induce frustration as the subjects’ 
goals of giving the correct answer and achieving a fast score are thwarted. The 
version of the game using the empathic agent, an apology as depicted in Fig. 1 
was shown to subjects, while in the other (non-affective) version the occurrence 
of the delay was ignored by the animated instructor. The main result of this ex- 
periment can be summarized as follows: If an embodied eharaeter shows empathy 
to a deliberately frustrated user, then the user’s skin conduetance is significantly 
lower than when the eharaeter does not display empathy, as compared to the pe- 
riod of induced frustration (the delay period). If the level of skin conductance is 
interpreted as the user’s level of stress or frustration, then this result indicates 
that empathic feedback may undo some of the user’s negative emotions. Since 
the parameters in the study were tightly controlled, it was possible to apply the 
first type of evaluation paradigm described in [17], namely, the identification of 
short-term ANS changes in response to specific interface events. 



4 Addressing Users’ Affective State 

The Empathic Companion is a life-like character that was developed in the con- 
text of a web-based job interview scenario, where it addresses the user’s emotion 
resulting from an interview situation (see Fig. 2). Being interviewed is likely to 
elicit emotions in the user, especially when the interviewer (Fig. 2, left) asks 
potentially unpleasant or probing questions, such as “What was your final grade 
at university?” or “Are you willing to work unpaid overtime?”, and comments 
pejoratively upon the interviewee’s (i.e. the user’s) unsatisfactory answer. In or- 
der to emphasize the training aspect of the interview situation, the user is led by 
a companion agent (Fig. 2, right) that addresses the user’s (negative) emotions 
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It seems you did not like this 
question so much. 
, J 



O I have already been working in this position for severai years. 

O i graduated from university recently and I do not have any professional experience. 



Fig. 2. Job Interview Scenario. 

by giving empathic feedback, e.g. “It seems you did not like this question so 
much” or “Maybe you felt a bit bad to be asked this kind of question” . The user 
is told that the companion is invisible to the interviewer and present for his or 
her comfort only. Although a web-based (virtual) interview cannot induce the 
stress level of a face-to-face or phone interview, it provides a convenient training 
platform for job seekers. 

4.1 System Architecture for Real-Time Emotion Recognition 

Since the Empathic Companion application assumes real-time emotion recogni- 
tion, the system architecture depicted in Fig. 3 has been implemented on the 
Windows XP platform. Below, we will explain each of its components. 

Data Capturing. The user is attached to sensors of the ProComp-l- unit from 
[15]. The ProComp-b encoder allows to use input from up to eight sensors simul- 
taneously. Currently, we only use galvanic skin response (GSR) and electromyo- 
graphy (EMC) sensors. Sensor input is digitally sampled by the ProComp-b 
unit and transmitted to the computer via a fiber-optic cable using the RS232 
COM port. Although the ProComp-b unit enables data sampling up to 256 
samples/second, GSR and EMC signals allow for a much lower rate, at 20 sam- 
ples/second. Data capturing is achieved by a module written in Visual C-b-b 
that employs the ProComp-b data capture library. 

Data Processing. When prompted by the application (i.e. interface events), 
the Data Processing component retrieves new data every 50 milliseconds, stores 
and evaluates them. Given the baseline information for skin conductance (GSR 
signal) and muscle activity (EMG signal), changes in ANS activity are computed 
by comparing the current mean signal values to the baseline value. The baseline 
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Fig. 3. System architecture. 



is obtained during an relaxation period preceding the interaction. The current 
mean value is derived from a segment of five seconds, the average duration of 
an emotion [7]. If skin conductance is 15-30% above the baseline, is assumed as 
“high” , for more than 30% as “very high” . If muscle activity is more than three 
times higher than the baseline average, it is assumed as “high”, else “normal”. 
Emotions are hypothesized from signals using a Bayesian network (as part of 
the decision network discussed below), but optionally, a “Signal to Emotion 
Mapping” module is available, if no decisions are required on the interface side.^ 
The connection between the Data Processing component and the User In- 
terface is established by the Active Template Library (ATL) which requires 
functions including Init (initializes the ProComp-l- encoder). Start (initializes 
data retrieval). Finish (de-allocates memory), GetBatteryLevel (retrieves cur- 
rent battery level), and DataTransmission (assigns data to variables). 

User Interface. The User Interface component contains the job interview scena- 
rio and runs under Internet Explorer 5.5 (or higher). It is written in HTML and 
JavaScript and utilizes the Microsoft Agent package [9] to control the verbal 
and non-verbal behavior of characters. This package includes an animation en- 
gine to trigger about 50 pre-defined 2D animation sequences and a text-to-speech 
engine. 

Decision-theoretic Agent. A decision network is used to combine bio-signals 
and other facts about the interaction, and relate them to emotions as well as 
agent decisions. 



^ This module is used in another application, the Emotion Mirror, where the user’s 
emotions are reflected back to the user. The mirror metaphor is realized by an 
animated interface agent that displays the user’s presumed emotion both verbally 
and non-verbally. The aim of this application is to facilitate the training of emotion 
management and regulation [13]. 
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The decision-theoretic agent will be discussed in Sect. 4.3. Before that, we 
will explain the interpretation of the user’s physiological activity as emotions. 

4.2 Relating Physiological Signals to Emotions 

[6] claims that all emotions can be charac- 
terized in terms of judged valence (pleasant 
or unpleasant) and arousal (calm or arou- 
sed). Figure 4 shows some named emotions 
as coordinates in the arousal-valence space. 

The relation between physiological signals 
and arousal/ valence is established in psy- 
chophysiology that argues that the activa- 
tion of the ANS changes while emotions 
are elicited [7]. The following two signals 
have been chosen for their high reliability 
(other signals are discussed, e.g. in [11]). 

Galvanic skin response (GSR) is an indica- 
tor of skin conductance (SG), and increa- 
ses linearly with a person’s level of overall 
arousal [6]. 

Electromyography (EMG) measures 
muscle activity and has been shown to correlate with negatively valenced emo- 
tions [6]. 

4.3 Decision-Theoretic Agent 

The decision-theoretic agent is responsible for deriving the user’s emotion given 
physiological data and the valence of the user’s answer (to the question of the 
interviewer), and to suggest an appropriate action. The agent is implemented 
with Netica from [10], a software package that allows solving decision problems 
and provides convenient tools, including an API in Java, which has been used 
to implement the agent. 

The decision network depicted in Fig. 5 represents a simple decision problem. 
A decision-theoretic agent selects actions that maximize the outcome in terms 
of some utility function [4]. The subnet consisting only of chance nodes is the 
Bayesian network used to derive the user’s emotional state. It relates physiologi- 
cal signals (GSR, EMG) and the user’s answer to arousal and valence which are 
employed to infer the user’s emotional state by applying the model of [6]. The 
probabilities have been set in accord with the literature (whereby the concrete 
numbers are made up). “Relaxed (happiness)” is defined by the absence of au- 
tonomic signals, i.e. no arousal (relative to the baseline), and positive valence. 
“Joyful” is defined by increased arousal and positive valence, whereas “Frustra- 
ted” is defined by increased arousal and negative valence. The node “Answer” in 
the network represents situations where the user gives a ‘positive answer’ (that 
satisfies the interviewer’s question) or a ‘negative answer’ (that does not satisfy 
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Fig. 5. Simple decision network. 

the interviewer’s question). This (‘non-physiological’) node was included to the 
network in order to more easily hypothesize the user’s positive or negative ap- 
praisal of the question, as the user’s EMG value changes (in this application) 
are often too small to evaluate valence. 

Besides nodes representing probabilistic events in the world (chance nodes), 
decision networks contain nodes representing agent choices (decision nodes), and 
the agent’s utility function (utility or value node). The decision node in Fig. 5 
lists some possible actions. If the advisor type is supportive, the utility function 
is set to give priority to empathic responses. 

— Show Empathy: The agent displays concern for a user who is aroused and 
has a negatively valenced emotion, e.g. by saying “I am sorry that you seem 
to feel a bit bad about that question” . 

— Encourage: If the user is not aroused, the agent gives some friendly comment, 
e.g. by saying “You appear calm and don’t have to worry. Keep going!” . 

— Ignore: The agent does not address the user’s emotion, and simply refers to 
the interview progress, by saying, e.g. “Let us go on to the next question”. 

— Congratulate: If the agent detects the user is aroused in a positive way, it 
applauds the user (“Well done!”, “Good job! You said the right thing”, etc.). 

“Advisor Type” is a deterministic (rather than chance) node that allows us 
to characterize the agent as supportive or non-supportive. If set to “Not Sup- 
portive”, the “Ignore” action is selected for all inputs. This node is needed to 
compare empathic vs. non-empathic versions of the companion. 
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5 Interacting with the Empathic Companion 

In an interaction session with the Empathic Companion, the user is seated in 
front of a computer running the job interview, with the GSR sensors attached 
to two fingers of the non-dominant hand, and the EMC sensors attached to the 
forearm of the same body side. The baseline for subsequent bio-signal changes is 
obtained during an initial relaxation period of one minute, where the user listens 
to music from Cafe del Mar (Vol.9), as the mean of GSR and EMC values. 



5.1 The Structure of the Interview 

An interview session is composed of (interview) episodes, whereby each episode 
consists of four segments. 

— Segment 1 : The interviewer agent asks a question, e.g. “Tell me about your 
previous work experience” . 

— Segment 2: The user chooses an answer from the set of given options (see 
Fig. 2, lower part), by clicking on the button next to the selected answer, 
e.g. the user admits the lack of experience by clicking the lower button. 

— Segment 3: The interviewer responds to the user’s answer, e.g. “Then you 
are not the kind of person we are looking for” or “I am happy to hear that 
you have extensive experience in the field” . 

— Segment 4 '■ The companion agent responds to the emotion derived from the 
data gathered during the third segment and the user’s answer given in the 
second segment. 

The entire interview session contains ten episodes, and concludes with the in- 
terviewer agent’s acceptance or rejection of the user as a new employee of the 
company, depending on how many ‘credits’ the user could collect. 

5.2 Exploratory Study 

While a questionnaire method is certainly possible to evaluate the impact of the 
Empathic Companion agent, we are using physiological data to assess the user’s 
perception of the interface. Since the ProComp-l- unit cannot be simultaneously 
employed for real-time data assessment and monitoring, the second author of 
this paper has designed a signal processor that reads users’ skin conductance 
(SC) and heart rate (HR). Like EMC, heart rate also correlates with negatively 
valenced emotions. Since SC and HR are slowly changing signals, it was sufficient 
to set the signal processor to 2 samples/second. 

Observe that unlike the experiment reported in [13], tight experimental con- 
trols are not practicable in the job interview application as the interaction is not 
designed to invoke specific emotions at specific moments. In particular, depen- 
ding on their answers to the interviewer’s questions, users may receive positive or 
negative feedback. Facing a comparable situation ~ users’ physiological responses 
to different web page designs - [17] thus propose to compare signal values for 
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Fig. 6. Sample SC data of subject interac- Fig. 7. Sample HR data of subject inter- 
ting with the Empathic Companion. acting with the Empathic Companion. 



whole interaction periods rather than for specific interface events. Following this 
paradigm, we hypothesize that, averaged over the entire interaction period, the 
presence of a (supportive) Empathic Companion will have users with lower levels 
of arousal and less negatively valenced affective states. As the control condition, 
the “Not Supportive” advisor type is used (see Fig. 5), where the “Ignore” action 
is always selected. 

We conducted an exploratory study on the overall effect of the presence of 
the Empathic Companion. The subjects are connected both to the GSR sensors 
of the ProComp-h unit with the first two fingers of their non-dominant hand,^ 
and to the second author’s unit that provides a wristband for SC (using the 
dominant hand) and an ear-clip to measure HR. The SC and HR data for one 
subject performing the interview with the (supportive) Empathic Companion 
agent are shown in Fig. 6 and Fig. 7, respectively. In the study, participants 
were 10 staff and students from the University of Tokyo, aged 23-40, who were 
randomly assigned to the “Supportive” and “Not Supportive” version of the 
Empathic Companion application, abbreviated as Em and NEm, respectively 
(5 subjects in each version). In the following, “AMrejaa:” refers to the (arithmetic) 
mean of the signal values obtained during the initial relaxation period, whereas 
“AMint” refers to the mean of the interview period. “Range” refers to the range 
of data points of the whole interaction period. 

First we compared AM^eiax and AMint without distinguishing between the 
two versions. For HR, AMrejaa:=77.83 and AMi„t=82.05, and for SC, AMreiax=G 
and AMi„t=7.81. However, none of those results were statistically significant. 
A possible reason might be that a significant signal change only occurs after 
the interview starts, and the difference between the relaxation and interview 
periods gradually diminishes during the rest of the interview. While it is not 
representative, the pattern of SC data depicted in Fig. 6 indicates this tendency. 



For simplicity, the EMG sensors have not been used. 
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Then we compared the Em and NEm versions by applying the equation 
(AMjnt — AMreiax)/Range. The intuition here is that a smaller value indicates 
that the character has a more calming effect on the user (SC) or decreases 
negative feelings (HR) to a higher extent. Our observations, however, do not 
show this. In the case of SC, the mean value of Em is 0.11, and 0.08 for NEm 
(contrary to our expectation). For HR, the mean value of Em is 0.04, and that 
of NEm is 0.06. Hence there is no significant positive effect of the supportive 
companion. If this result remains to hold for a larger number of subjects, we may 
say that empathic character behavior does not have an impact on the affective 
state of users. However, it is worth noting that the current application (the 
job interview) has the empathic character interact with the user only in a very 
limited way. Most of the time, the user interacts with the interviewer, whose 
behavior is the same in both versions. Therefore, we consider to design another 
experiment that allows a more direct comparison between the Em and NEm 
versions. For instance, the character could perform as a virtual medical health 
assistant that asks the user about his or her health-related lifestyle, and then 
comments on the user’s answer in a neutral or empathic fashion. 

Certainly, other reasons could be responsible for the lack of significance of 
our results, (i) The responses intended to have a calming effect on the user might 
actually not do so. (ii) Increased arousal might also be interpreted as positive 
arousal, especially for users performing well in the job interview, (iii) Heart rate 
might not be a reliable indicator of negative valence for all users. These and 
other issues will addressed in our future work. 

6 Conclusions 

This paper describes the Empathic Companion, a character-based interface that 
takes physiological signals of the user in real-time and addresses user emotions 
derived from those signals. A virtual job interview serves as an exploratory appli- 
cation scenario. While results of statistical significance could not be obtained in 
the current setting, previous findings indicate the frustration-reducing effect of 
empathic feedback [5,13]. Thus we expect that the Empathic Companion will be 
beneficial in applications where negative feelings are involved or likely to occur, 
such as online customer support or computer-based education. A particularly 
interesting application field is tele-home health care where the Empathic Com- 
panion may play a crucial role in addressing patients’ affective and social needs 
[8] . In the near future, we plan to implement a more straightforward use of the 
Empathic Companion, i.e. in the role of the main and only interaction partner 
of the user rather than a secondary interlocutor as was the case in the interview 
scenario. We will also consider to use text-based (empathic) feedback in order 
to compare “persona” vs. “no persona” conditions. 
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Abstract. The paper presents the cognitive-model-based approach of 
abductive interpretation of emotions that it is used in the multi-modal 
dialog system SmartKom^. The approach is based on the OCC model 
of emotions, that explains emotions by matches or mismatches of the 
attitudes of an agent with the state of affairs in the relevant situation. 
It is explained how eliciting conditions, i.e. abstract schemata for the 
explanation of emotions, can be instantiated with general or abstract 
concepts for attitudes and actions, and further enhanced with conditions 
and operators for generating reactions, which allow for abductive infe- 
rence of explanations of emotional states and determination of reactions. 
During this process concepts that are initially abstract are made con- 
crete. Emotions may work as a self-contained dialog move. They show a 
complex relation to explicit communication. Additionally we present our 
approach of analyzing indicators of emotions and user state, that come 
from different sources. 



1 Introduction 

For a long period, the concept of rational agents that exchange rational argu- 
ments, was the predominant paradigm for research on dialog systems. In the last 
decade the scientific community became aware of the fact that emotions, moods 
and other attitudes play an important role in natural communication. While 
there are considerable advancements in generating affective artificial agents that 
display believable emotions in appropriate situations (cf. [1]), the recognition 
and interpretation of human emotions in dialog systems is still in its infancy. 
The term emotion normally aims at pronounced, clear forms of human states 
marked by strong feelings such as, e.g., anger, fear, sadness, joy, etc. - the so 
called “full-blown, big” n {n typically ranging between 4 and some twenty) - 

^ This work was funded by the German Federal Ministry of Education, Science, Re- 
search and Technology (BMBF) in the SmartKom project under Grant 01 IL 905. 
The responsibility for the contents of this study lies with the authors. 
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emotions. At a close look, however, almost nothing is that clear-cut: the under- 
lying (bodily and cognitive) processes are not yet fully understood, emotions do 
often occur in mixed, not in pure forms, their marking can be overtly suppressed 
due to social constraints and rules [2, 3,4, 5], and there is no full agreement as for 
a catalogue of emotions, and of pivotal characteristics, telling emotions apart 
from other states as attitudes, mood, etc. 

Research concerned with the generation of an affective and believable beha- 
vior of artificial agents is often based on the so-called OCC model of emotions 
[6] that explains emotion by cognitive processes relating the user’s goals, stan- 
dards, likes and dislikes to the actions of other agents and the state of the world 
that results from these actions. 

Though mixing or suppressing emotions is a problem for the recognition of 
emotions as well as for the fine-tuning of the artificial generation of emotional 
behavior, the OCC model provides a systematic account for relating a certain 
situation to emotional states that fit to this situation. The logical structure of 
the situation that causes a certain emotion is not affected by the question of how 
intensive an emotion is or if it is displayed at all. 

For research, which is concerned with the detection of problematic situations 
in communication by analyzing the user’s behavior, not only emotions are 
relevant. This is the case, independently from the question whether the catalogue 
of emotions is completely defined or not. For instance, if the user is hesitant, she 
may need help, or if she is tired system probably should recommend the user 
to stop some activity that needs high attention. We use the term “(emotional) 
user states” to encompass all non-neutral, somehow marked behavior of the user 
within a human-machine-communication. From this point of view, user states 
as bored, stressed, irritated, tired, and so on, can and have to be addressed as 
well, irrespective of whether they belong to the one or the other psychological 
or physiological category. 

In contrast, the psychological or physiological category of a state is relevant 
for its interpretation. The spirit of the approach, namely to consider what type of 
conditions elicit the affective state of the agent, extends to some non-emotional 
states, but not to all. For instance, explaining the state of being hesitant by a 
lack of information may lead to a helpful reaction, while identifying the cause of 
the state of tiredness (in the literal meaning) is - if possible at all - of limited use 
in a dialog system. Rather, the system has to consider the possible consequences 
of the user’s state. 

Some important conditions have to be met, however, if one wants to deal 
with user states in an automatic system: 

— It must be possible to classify the state correctly up to a satisfying extent; 

— thus there has to be a sufficiently large training sample, and 

— the respective user state can be processed within the whole system, not only 

at the classification stage. 

The first condition means that we should start with user states that are clearly 
marked. This rules out states such as “slightly irritated” - even if they might 
occur quite often and have a strong impact on the felicity of communication. We 
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are thus left with those pure emotions like anger or joy which do, alas, not occur 
often in real or in Wizard-of-Oz human-machine-communications [4, 3, 5, 7]. 

Overview: The focus of the paper is to present the cognitive-model-based 
approach of abductive interpretation of emotions as it is used in the SmartKom 
system. We do not elaborate on the recognition methods that are utilized in 
SmartKom. We refer to [8] for a presentation of the recognition of emotions and 
user states by prosodic analysis and to [9] for a presentation of the recognition 
from facial expression. 

We start with a brief description of the architecture of emotion analysis in 
SmartKom in the second section. In the third section we introduce the type 
of interaction that we want to realize in the system. In the fourth section we 
present our approach of calculating evidence for certain user states by combining 
indicators from different sources. The remaining sections are dedicated to the 
interpretation of emotions and user states and the generation of reactions to 
these states. 



2 Emotion Processing in the SmartKom System 

SmartKom (www.smartkom.org) is a multi-modal dialog system that provides 
access to multiple applications [10]. In addition to input-modalities that are used 
for intentional communication, the system accounts for the emotional state of 
the user as it is displayed by facial expression or by prosody. The processing of 
emotions and user states consists of three stages: 

— At the first stage the emotional state of the user is recognized from facial 
expression and prosody. 

— At the second stage indications of problematic situations and the emotional 
state of the user are collected from several sources and collectively evalua- 
ted. The component also analyzes the dialog with respect to the style of 
interaction and the task and paradigm knowledge of the user (cf. [11]). 

— The interpretation of emotions and user states, and the generation of reac- 
tions to these states build the third stage. It is realized by the so-called dy- 
namic help, a component that is dedicated to manage sub-dialogs to provide 
presentation specification and intention analysis in problematic situations 
that are not handled by the standard dialog component of SmartKom (cf. 
[ 12 ]). 

3 The Use Cases 

To demonstrate the added value of user state classification and its subsequent 
processing in the SmartKom system, we designed so called Use Cases. The first 
use case is intended to show how a merely emotional reaction, without explicit 
communication, can work as a self-contained dialog move. In this case, joy or 
anger are interpreted as positive or negative feedback. In the second use case 
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emotion works as a semantic operation that turns a positive feedback into a 
negative one, which is considered as a form of sarcasm. 

In both use cases, the system suspects that the emotional reaction may be 
caused by a like or a dislike concerning the properties of the presented objects. 
If reasonable candidates of such likes or dislikes can be identified that are not 
already known by the system, it starts a preference update dialog. 

If the system knows positive or negative preferences, it first presents objects 
that contain a preferred feature; objects that show a disliked feature will be 
shown last.^. 

user: What’s on on TV tomorrow? 

system: shows talk show at the top of the display, in the middle popular music, 
and crime at the bottom. 

user: And what’s in the evening, in the First Program? 
system: shows a science fiction movie. 

First constellation: emotion-only: 
user: displays joy via facial gestures. 

system: Do you like science fiction? Shall I account for that in future presen- 
tations ? 

Second constellation: emotionally marked verbal communication: 
user: That’s really a beautiful program! (She produces this sentence with an 
angry prosody. The positive feedback is analyzed as being sarcastic) 
system: You don’t like science fiction? Shall I account for that in future 
presentations 

user: Yes. /No. 

system: OK. I’ll take care of that! 

(Suppose, the user’s answer was yes: In the first constellation science fiction 
will be presented at the beginning of a presentation, in the second constellation 
at the end.) 

user: Please, again tomorrow’s program! 

system: Shows science fiction at the beginning (at the end) of a presentation. 

Instead of answering no or yes the user may also correct the supposed like 
or dislike, e.g. by saying No, I like crime movies or she may just ignore the 
question, by moving to a different topic. In such cases, the system will simply 
not re-arrange the order of presentation. 



^ It is possible that an object has liked and disliked attributes, e.g., there may be a 
movie with a preferred genre, in which a disliked actor plays. 
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4 Analysis of Indicators of Emotions and Problematic 
Sitnations 

We introduced in SmartKom a component, the interaction module, that collects 
and evaluates indications of emotions, problematic situations and other aspects 
of the interaction. Indicators can have values between 0 and 1 and these values 
may change in time. 

The interaction module provides a set of models each representing a certain 
user state or a feature of the interaction as output. Each model value is also in 
the range between 0 and 1. Several models support the recognition of emotions 
and try to detect problematic situations during a dialog. The indicator values 
are mapped to the models by means of a matrix multiplication. 

Problematic situations and user state information are expressed by three 
models: 

— One model describes the likelihood that the user is angry by combining scores 
from facial expression analysis, emotion extraction from prosody, and use of 
certain words. 

— A second model combines confidence values from recognizers and similar 
scores from speech analysis, domain model, discourse history and intention 
recognition as well as differences in the distribution of these values among 
concurring hypotheses; this model is supposed to indicate problems in the 
analysis part of the system. 

— A third model estimates the dialog progress. Here, the ratio of new informa- 
tion items to total information items (after completion by discourse analysis) 
is employed as one important indicator. 



5 Cognitive-Model-Based Interpretation of Emotions 

Our approach to the analysis of emotions is based on the OCC model of emotions 
developed by Ortony, Clore and Collins. Following the OCC model, emotions are 
characterized by their eliciting conditions. These conditions consist of a certain 
combination of 

— the goals of the agent in this situation 

— her attitudes to certain events (mainly likes and dislikes) 

— the standards that she uses to (morally) judge an event 

— the facts that hold in a certain situation 

— the actions (of other agents) that caused these facts 

For triggering an emotion, it is important how facts are related to the goals 
and the likes and dislikes of the user. Especially, it is interesting if they coincide 
or not. Standards are important for emotions as anger or gratitude that contain 
criticism or praise of another agent based on her actions. Eliciting conditions 
can be viewed as expressing the cause of an emotion by providing a cognitively 
comprehensible explanation of an emotion. The following eliciting condition for 
anger is taken from [13]: 
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anger (Agent 1 , Agent2 , State , Sit) if 
holds (did(Agent2 .Action) ,Sit) , 
causes (Action , State , SitO) , 
wants (Agentl , non_State,Sit) , 
blameworthy (Agent 1, Act ion)) , 

(SitO < Sit) 

This condition means that the agent is angry, if she believes that another 
agent caused some state of affairs that contradicts her goals by performing an 
action that is not acceptable according to the user’s standards (expressed by 
the blameworthy predicate). By the situation variables Sit, SitO, one can express 
how the elements of the conditions are connected with respect to the sequence 
of situations that occur (subsequently we will omit situation variables) . 

Recognizing the intensity of emotions could provide additional valuable infor- 
mation, e.g., slight anger may occur at the beginning of a problem, while strong 
anger may indicate an enduring problem. But the recognition of the situation 
that caused the emotion and the generation of appropriate reaction is basically 
the same whether emotions are displayed slightly or strongly. 

5.1 Abductive Interpretation of Eliciting Conditions 

The OCC model is mainly used for the generation of the behavior of an ani- 
mated agent. In this case, one can deliberately define the agent’s likes, dislikes 
and standards in advance. If we want to interpret emotions that are displayed 
by an agent, we have to find out, which combination of facts, attitudes and 
standards may have caused the emotion. Our approach is to achieve this by 
analyzing eliciting conditions in an abductive manner. Abduction as a form of 
practical inference is introduced by Peirce [14] . Abduction is often characterized 
as inference to the best explanation: Suppose, we observe some fact A, which is 
surprising for us. If we know the rule 

B,C ^ A 

(i.e. A is true if B and C are true), then we may suspect that also B and C are 
true, because this would plausibly explain A. If we know that there is another 
rule 

D ^ A 

then D is another candidate for explaining A. Hence we need a criterion to decide 
which explanation is better. The quality of an explanation depends on two fac- 
tors: Do we know all relevant rules (i.e., explanations)? Do we possess criterions 
to choose from explanations?. With eliciting conditions we have the advantage 
to possess schemata that claim to characterize all possible explanations of an 
emotion. 

5.2 Problems with Abductive Interpretation 

Eliciting conditions are abstract schemata that cannot be used directly to infer 
possible causes of emotions. To perform abductive reasoning on eliciting con- 
ditions, we have to identify concepts that could be filled into the schemata. 
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Seemingly, we are in a problematic situation. The system has no information 
about the user’s standards, likes and dislikes in advance. It can get information 
about her goals from the user’s input. But, on the one hand, this information 
may be based on misunderstanding, and, on the other hand, the user may have 
goals which cannot be recognized from her utterances. Similar problems occurs 
with the actions of the system. Action that are based on misunderstanding are 
not relevant for the analysis of the user’s emotion^ . 



5.3 Abstract Goals and Actions for Emotion Interpretation 

To overcome the problems mentioned in the last paragraph, we introduce meta- 
goals concerning general principles of communication and abstract goals con- 
cerning user needs that (to some extent) depend on the application. For every 
meta-goal or abstract goal we introduce an abstract action that satisfies the goal. 

For instance, to account for misunderstandings, we introduce understanding 
as an action on the meta-level and to be understood as a goal on the meta-level. 
To account for user preferences, we introduce the concept that a presentation 
accounts for the user’s preferences as an abstract action of the system — let it 
be called presentByPreferences — and accordingly the possible abstract fact or 
user goal isPresentedByPreferences'^ . This goal is abstract and under-specified 
because we do not know the concrete preferences of the user. Further, the relevant 
types of preferences depend on the type of the application. 

Reasonable goals (facts, actions, likes, standards) have to be identified by 
careful analysis of general principles of communication and the needs of the user 
with respect to the type of applications she is working with. This needs empirical 
validation, which could not be provided within the scope of the SmartKom pro- 
ject. Which set of concepts is chosen, depends also on practical decisions: which 
goals will the system support at all, will the system possibly recognize goals 
that it is not able to handle, will the system react on any recognized emotion in 
some way (e.g. by regretting as a default in case of anger), or will it only react 
to emotions to which it can provide a repair or other meaningful cooperative 
reaction? We demonstrate the approach by the example of anger. 



General Concepts. We first look for actions or facts that may contradict the 
user’s wishes, likes, dislikes, or standards on a general level. Important candidates 
for abstract actions that contradict the user’s wishes are misunderstanding, 
slow processing and requests with a negative or disliked outcome. Ac- 
cordingly, we stipulate abstract or general goals, e.g., the goal to be understood 
properly. 

® Although the type of the action that the system wrongly performs, may influence 
the intensity of the user’s negative feelings 

^ For convenience we often identify the name of the fact and the name of the goal to 
make this fact true. 
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Application Dependent Concepts: Problematic Results of Database 
Queries. According to our use cases we concentrate on requests with liked or 
disliked outcome as a source of negative or positive emotions. We identified four 
types of disliked results: 

— the result is empty, 

— the majority of retrieved objects show features that are not liked by the user, 

— the objects are presented in a way that is contrary to the preferences of the 
user, e.g, by presenting disliked objects first, and 

— the user query resulted in a recall, which is too large. The user may need 
help for further specification possibilities. 

We assume for the list of topics above, that the disliked or problematic results 
are not due to misunderstanding. Misunderstanding is taken as evoking its own 
class of constellations. If misunderstanding is involved, the result is not relevant 
for the analysis. 



User specified goals and system initiated actions. As far as no misun- 
derstanding is involved, the SmartKom system will usually simply follow the 
user’s specification. If this works, no anger should arise with respect to the fact 
that the system tries to achieve this goal (but perhaps instead joy). In specific 
situations the system may initiate actions that are necessary from the point of 
view of the system, but may be disliked or even considered blameworthy by the 
user. For instance, the system may require a biometric verification, which the 
user dislikes. Such actions are relevant for explaining negative emotions, but are 
not considerd in our implementation. 

As an anonymous reviewer pointed out, that inappropriate or undesired re- 
actions on emotions could also be a cause for anger (or for being bored) . In fact 
this is a subcase disliked system initiated actions. 

6 Analyzing and Handling of Pnre Emotion 

With the concepts introduced in the last section, we are able to build instantiati- 
ons of eliciting conditions that allow to infer combinations of goals, facts, actions, 
likes and dislikes that possibly explain the user’s emotion. We call instantiations 
of eliciting condition schemata eliciting constellation. To get criterions for 
selecting the relevant constellation, we augment constellations with conditions 
and organize these conditions internally as a decision tree^. 

Further, the system has to determine reactions that are appropriate for 

— resolving the situation that caused the negative emotion, 

— avoiding negative emotions in future in similar situations, and 

— promoting the occurrence of positive emotions in similar situations. 

® As mentioned in the conclusion we could perform testing only in limited way. Thus 
no training of the decision tree was possible 
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It is also desirable to include methods that provide abstract under-specified 
goals and actions with presumable values. Such values are not only used for 
determining concrete system reactions, they serve as a part of the constellation 
conditions. 

According to our use cases, we have to consider database queries that retrieve 
disliked objects. The system offers as repair that it will regard the likes and 
dislikes of the user in its presentations. 

A constellation for handling anger according to our use cases is given below 
(leaving out some minor details) in a Prolog style notation. It applies to browsing 
television program or cinema program. For these applications preferences are 
actually taken into account for the presentation. This rule are basically processed 
in the following manner: First the conditions are tested (internally the conditions 
are processed in a decision tree like order). Then the cause of the emotion, 
which is represented by the clauses above the conditions, is considered as a 
reasonable explanation, whereby the variables are filled by the result of the 
condition processing. Then the system action is performed. 

anger (thisConstellation.user, system) if 

holds (didCsystem, non_presentByPreference (dislike (user, X))) , 
causes (non_presentByPref erence (dislike (user , X) ) , 

non_isPresentedByPref erence (dislike (user , X) ) ) , 
wants (user, isPresentedByPreferences (dislike (user, X))) , 
blameworthy (user , non_presentByPreferences(dislike(user,X)))) , 
conditions (thisConstellation, X) , 

(proposed system action:) update(dislike(user,X)) . 

The constellation expresses, that there is a concrete reading of the goal, that 
there is a concrete reading of the goal pres entBy Preferences that may be a goal 
of the user, that this goal is not satisfied, and that ignoring the goal is against 
the standards of the user. The constellation contains facts and actions that are 
not concretely specified. For instance we do not know whether the presentation 
contains some possibly disliked feature, and we do not know which feature it is. 

We test the salience of the constellation by establishing the following conditi- 
ons. The predicate presentationEntriesContainCommonFeature(X) also delivers 
a concrete presumable instance of the user’s dislike. 



condit ions (thisConstellat ion, X) if 

presentationEntriesContainCommonFeature(X) ) , (1) 
non_specif ied(user ,X) , (2) 
non_knows(system,like(user,X)) , (3) 
non_knows(system,dislike(user ,X) ) . (4) 



(1) verifies if the user perceives too many objects with the supposed disliked 
feature. (It also excludes the case that there is no result at all, which would 
support a different explanation for anger). It is important for the other tests, 
that the predicate delivers a hypothesis for the disliked feature. (2) excludes that 
the user is angry about the occurrence of features that she has specified in her 
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request (there is a possibility of misunderstanding). (3) excludes, that the user 
is angry about a feature, that she has already declared to like. (4) excludes that 
the system in fact tried to present the disliked feature appropriately but just did 
not find other objects. 

emotion. For emotions displayed by facial expressions, we prove if the emotion 
emerges in a certain time interval after the presentation was displayed. With 
prosodically displayed emotion we prove if the verbally expressed content was 
compatible with the explanation of the emotion. It turned out that it is not 
sufficient to test if there are already stored preferences. It should additionally 
be proved, if a user has not agreed with storing a preference. This has to be 
remembered, otherwise the system may propose the same preference repeatedly. 

The action update(dislike,user,X), which is attached to the constellation, in- 
itiates a sub-dialog that verifies if the user has the supposed dislike. It is not 
only a repair action, but takes part in the explanation process. 

The conditions mentioned so far are not sufficient to discriminate compe- 
ting explanations. Such competing explanations have to be modeled, even if 
no reaction is foreseen for these cases. We distinguished three main sources of an- 
ger: misunderstanding, slow processing, and requests with a negative or disliked 
outcome. Evidence for problems in the analysis part is detected by the interac- 
tion module (cf. section 4). Slow processing is a possible explanation for anger, 
if anger occurs during the analysis. Also the absolute duration of processing is a 
criterion. These dates are accessible via a module (the so called watchdog) that 
monitors the processing state of the system. 

7 Emotions and Communicative Acts 

Emotions that are signaled by facial expressions do not need to be accompanied 
by additional communication at all. Emotions expressed by voice are naturally 
related to some acoustic output. In the extreme, this output is only a container 
for the expressed emotion, but usually it contains a certain semantic content. 
The analysis of the relation between semantic content and underlying emotions is 
in its infancy, compared, e.g., with the relation between verbally communicated 
semantic content and pointing gestures. The latter is sufficiently known to build 
practical application. We distinguish in the following between communicative 
acts with semantic content, that are provided by speech and gestures, on the 
one hand, and emotions on the other hand. 

The interpretation of pointing gestures and verbal utterances can be concei- 
ved as a fusion process, which unifies pieces of information. Semantic contradic- 
tions between pointing gestures and verbally provided information are indicati- 
ons for errors. The relation between emotions and communicative acts is much 
more complicated. We give a presumably non-exhaustive classification of types 
of interaction between displayed emotion and communicated semantic content. 



Redundancy. Semantic content redundantly expresses a simultaneously dis- 
played emotion as that makes me angry or I’m glad about that or semantic 
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content expresses an attitude that corresponds to the direction of the emotion 
(whether it is positive or negative) as great, bad. 



Contribution to the Explanation of the Emotion. Semantic content ex- 
presses a concrete attitude (like or dislike) that is involved in triggering the 
emotion as I don’t like thrillers or great movies, or semantic content addres- 
ses the facts and actions that caused the emotion as you didn’t understand me 
or that takes too much time or simply by uttering thrillers accompanied by a 
positive or negative emotion. 

The thriller example contributes the concrete feature, that may fill the ab- 
stract goal of being presented accordingly preferences. But this example does not 
necessarily express a like or dislike as great movies. With a negative emotion, 
the example may also belong to the topic Semantic Content as Repair Action. 



Semantic Content as Repair Action. The semantic information is provided 
to repair the state of affairs, that has caused the emotional state of the user. 
The example thriller works also here: thriller could be a correction of a mi- 
sunderstanding of genre. There is no direct relation between the content of the 
utterance and the displayed emotion. 

This is very common and important in human-machine-dialog as well as in 
human human dialog: the dialog partner repeats or reformulates her request and 
concurrently displays a negative emotion. With overt anger, it could also be 
expected that the user cancels the interaction as a final form of repair. 



Change of Semantic Content. The user displays a negative emotion and 
communicates verbally a positive attitude marvelous, great movies. The direc- 
tion of the valenced attitude that is communicated verbally is changed by the 
direction of the displayed emotion. This is a simple form of sarcasm. 

8 Results and Conclusion 

A complete implementation of the whole processing chain was available at the 
end of the project. There was no opportunity for systematic tests, which require 
high effort. For instance, the recognition of facial expression needs careful prepa- 
ration of the environment in respect to lighting conditions to work. Our limited 
testing shows, that, provided recognition is correct, the emotion interpretation 
generates the reactions that are requested by the use case specification. 

We implemented successfully a cognitive-model based approach for analyzing 
emotions and other affective states of a user that participates in a multi-modal 
human-machine dialog. This is a success, but it will still take considerable effort 
to make it practically useful. The approach is based on an elaborated theory, 
which covers a broad range of phenomena. This is promising with respect to the 
extensibility of the approach. It is an important advantage of the approach that 
it generates conceivable explanations of emotions, that allow for well directed 
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system reactions. The approach is not restricted to handle classical emotions, 
but extends to other affective states. Also it is not restricted to states, that are 
displayed non-verbally. Affective verbal feedback, as I like this, can be explained 
along similar lines. 
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Abstract. In affective dialog simulation, recognition and interpretation of the 
affective state of the user should be integrated with display of empathy by the 
system and with dialog planning and execution. Cognitive models dealing with 
the inherent uncertainty of this interpretation are the method we propose to 
adopt. We describe how we integrated these models in an information-state ap- 
proach to dialog modeling by illustrating, in particular, their application to a 
decision support system which is tailored to the ‘state of change’ of the user. 



1 Introduction 

In computer-based advice-giving dialogs, the system aims at convincing the user to 
follow a given plan, by providing the information which is needed to increase the 
chance of success of its persuasive action. This chance of success increases if the 
system establishes an empathic relationship with the user: even in domains which 
might appear as neutral, empathy contributes to make advice more acceptable. Recog- 
nizing the user attitude and adapting the advice accordingly is more crucial when this 
concerns a course of action that, for some reason, the user may find difficult to follow: 
typically, cease smoking or change healthy eating habits. In this case, the amount and 
type of information provided has to be calibrated to the attitude of the user towards 
behaviour changing: this attitude includes her beliefs of what a 'correct' behaviour is, 
her intention to change behaviour if incorrect, her knowledge of a plan of how this 
goal may be achieved. Prochaska and colleagues [15] proposed a theoretical model of 
behaviour change and a method to assess the state of a subject in this change and to 
find the 'best' advice-giving procedure in every stage. Their model became a reference 
for a large number of public health services all around the world. 

Adoption of an advice-giving policy which is tailored to the attitude of the user re- 
quires, first of all, recognizing this attitude and, secondly, selecting the strategy to 
adopt from a library of known and experimentally evaluated alternatives. As this proc- 
ess occurs in a situation of uncertainty, both tasks have to be seen as dynamic proc- 
esses, in which the system proceeds by a trial-and-error method. Dialogs are much 
more suited to this task than monologs, as they enable the system to monitor the situa- 
tion, revise its hypothesis and re-adapt dynamically its strategy if needed. 

A dialog model of this kind needs a sophisticated component of interpretation of the 
user moves. Not only beliefs, intentions and plans have to be inferred from analysis of 
these moves, but also the (stable or transient) 'affective' components of the user mind-: 

E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 77-88, 2004. 
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her personality, mood and emotional state. On the other hand, selecting the appropri- 
ate advice-giving procedure alone is not a guarantee of success; the system has to 
show, at the same time, that it is not a 'purely rational' agent but shares part of the 
emotions of the user, in order to establish a relationship of trust and therefore to in- 
crease its chance of success. The system knows, finally, that what it will say will influ- 
ence the user's rational and emotional attitude, that is her beliefs, intentions and plans; 
it may exploit this knowledge to achieve its goal of persuading the user to follow the 
appropriate behaviour. This exploitation is not unfair, because it is done 'for the in- 
terlocutor's sake'. 

An advice-giving dialog system of this kind needs a sophisticated model of both the 
system and the user, which extends the well known BDI approach with a non-rational 
component (BDI&E). It needs a dialog simulation method which adapts dynamically 
goals and plans to these two models and to a forecast of the possible effects of a can- 
didate move on the user's affective and rational state. In this paper, we describe how 
we simulated such a dialog system by extending the 'information-state' model [17]; an 
example in the healthy eating domain will illustrate the method. 



2 Two Example Dialogs 

Let us start from the following excerpts of dialog: 

Dialog 1: 

S3: 1 may try to help you, but cannot take a decision in your place. 

U3: 1 would like to do something but it's so difficuit! 

S4: 1 know what you mean: diet is difficult to follow, particularly if you eat out of home because 
of work. But the effects may be very positive! 

U4: Are you sure? 

S5: Of course! Being in good shape can have beneficial effects on your appearance and health! 
I'm sure you know this, do you? 

Dialog 2: 

as in Diaiog 1 

U4: 1 know! I definitely intend to change my eating habits. 

S5: Good! Why, for instance, don't you try to avoid daily lunches in a bar or high-fat sandwiches? 
US: Everyday you mean? I may try, but I must find an aiternative. 

S6: You may bring something from home: some fresh fruits or vegetables, for instances. And 
have a cup of tea or a fruit juice. 

In these examples, the system interleaves suggestions with sentences aimed at assess- 
ing the user reaction and inferring her attitude. At move U4, in dialog 1 the user seems 
to show a ‘doubtful’ attitude about her problem behaviour and the system (in S5) tries 
to encourage evaluation of the advantages of changing it; in the second one, her an- 
swer is positive and suggests that she probably considers her behaviour as incorrect: 
the system that provides an initial suggestion of what to do. 
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3 State of Change Model 



This model [15] describes how a subject changes of position from a 'wrong' behaviour 
in some domain to adoption of a 'correct' one. Six main stages are identified in this 
process of change (see Table 1). 



Table 1. Signs and Actions in the SoC Model [15] 



State of Change 


Signs 


Actions 


Pre-contemplation 


Does not want to change 
behaviour, now or later 
Believes that behaviour is 
acceptable 


Validate lack of readiness 

Clarify: decision is theirs 

Encourage re-evaluation of current behaviour 

Encourage self-exploration 

Explain and personalize the risk 


Contemplation 


Doubts about behaviour 
acceptable. 

Thinks about why one follows 
the bad habit, what its payoff is 
Considers seriously the op- 
portunity of changing behav- 
iour 

Does not want to commit to 
change it soon 


Validate lack of readiness 
Clarify: decision is theirs 

Encourage evaluation of pros and cons of behaviour change 
(bring both the rational mind and the emotions into play to 
move you to a commitment to change) 

Identify and promote new, positive outcome expectations 


Preparation 


Believes that behaviour should 
be changed 

Intends to change behaviour 
soon 


Assist in problem solving 
Provide social support 

Verify that patients has underlying skills for behaviour 
change 

Provide counselling 

Encourage initial steps: remove temptations, plan how 
action will be taken, arrange support and understanding by 
your neighborhoods 
Suggest a follow-up plan 


Action 


Is already following a plan to 
change behaviour (from some 
months) 


Follow closely in order to sustain change of behaviour 
Bolster self-efficacy for dealing with obstacles 
Combat feelings of loss and reiterate long-term benefits 


Maintenance 


Is maintaining change from 
more than 6 months 


Plan for follow-up support 
Reinforce internal rewards 

Discuss coping with relapse: learning the difference be- 
tween lapse and a total relapse; be prepared to recognize a 
lapse and take immediate action 


Relapse 


Went back to incorrect behav- 
iour 


Reassess motivation and barriers 

Assess state of change 

Plan stronger coping strategies 



The model suggests how these states may be recognized from a set of signs and which 
actions may be adopted to promote a correct behaviour in each of them. Signs display 
the following aspects of the subject’s mental state: 

• value given to the ‘correct’ behaviour which is the object of the persuasion 
process; 

• knowledge of reasons influencing the adoption of a problem behavior; 
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• belief that (internal and external) conditions exist to change this behavior; 

• intention to change own behaviour if wrong; 

• knowledge of an acceptable plan which enables achieving this intention; 

• level of perseverance in following the plan. 

Actions that may be applied at every state of change respond to the following goals: 

• Recognize the situation 

• Inform and encourage evaluation processes rather than enforcing persuasion 

• Influence intentions 

• Check abilities 

• Suggest plans 

• Support during plan execution. 

This model may therefore be adopted as a powerful knowledge source to build cogni- 
tive models of the users’ attitude and to decide how to tailor advice-giving to their 
state. 

As suggested in [19], state of change of behaviour and emotional state are strongly 
interrelated. For instance, in the precontemplation state subjects may be demoralized 
about their ability to change, if they tried to do it and failed; in the maintenance state, 
people are increasingly more confident that they can continue their change; negative 
affect or emotional distress are indicators of the risk of relapse, and so on. Recogniz- 
ing some aspects of the emotional state (valence and arousal) may therefore contribute 
to inferring the state of change and, at the same time, may drive the choice of an ap- 
propriate persuasion strategy. 

Accurate measurement of the state of change can be performed by asking individuals 
to respond accurately to a series of unambiguous questions. To this aim, a question- 
naire may be administered to the subject at the dialog start. Alternatively, the state of 
change (and the emotional state) may be inferred progressively during the dialog; 
uncertainty in knowledge of this data has then to be accepted as a critical feature of 
dialog simulation, and the dialog modelling method must be adapted to this situation. 
This is the approach we take in this paper. 



4 Cognitive and Affective Models 

Let us adopt the following notations: 

Ai, Ah denote the two interlocutors of the dialog; U denotes, in particular, the user, S 
the system; x, y, z, denote domain facts ; g, g' denote desired states of the world; e 
denotes an emotion. The following formulae stay for respective sentences: 

(Say Ai x) for "agent A, says x (Bel Ai x) for "Aj believes that x (Goal A g) for "Aj 
wants that g"; (Bel Ai (Ev-Ach Ai g) for "Ai believes that its goal g will be achieved 
sometimes in the future": (Feels Aie) for "At feels the emotion e". 

What an agent Ai says is a function of its own state of mind and of its goals and beliefs 
about the interlocutor's state of mind; this communicative act influences Ah’s state of 
mind. According to this model, an agent move (Say A x) activates two sorts of proc- 
essing activities, depending on who A is: 
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Manipulation: Ai is the system and Ah is the user. The model allows to forecast the 
possible effects of a system move on the emotional state of the user and therefore to 
select the most promising strategy to apply for a given user and in a given situation. 
We represent this model with a dynamic belief network, as described in more detail 
elsewhere [3]. 

Empathy: we employ a similar model to simulate activation of emotions in the agent, 
after a user move (see figure 1). 



QommunicativQ Acf of Ai 




Fig. 1. The emotion activation model 

Interpretation: Ai is the User and Ah is the system. The model allows to infer the 
user’s state of mind from analysis of her move. We implemented this model with a 
static belief network whose ‘hidden variables’ are the state of change and the emo- 
tional state and whose ‘observables’ are the user moves (figure 2). 




Fig. 2. The User move interpretation model 
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As we said in Section 3, the 'state of mind' should be intended, in this case, as a com- 
bination of state of change and emotional situation (defined in terms of valence and 
arousal, as in the circumplex model of affect [16]). The intermediate nodes of the 
network are the aspects of the user’s mental state that we described in Section 3. The 
user may display these cognitive features through one or more communicative acts 
(leaf nodes). For instance: If the user says "I definitely intend to change my eating 
habits” or “I want to do every possible effort to get out of this", these moves will be 
interpreted as a strong intention to change. If he says "/ tried several times but I 
failed", the agent may infer that the user has the intention to change, formulated a plan 
to achieve this goal but did not go on in applying it. Therefore, reasoning is performed 
in this BN by introducing evidence in one or more leaf nodes and checking how the 
probability distribution of the “state of change”, the “valence” and the “arousal” vari- 
ables change. 

Application of the three models of manipulation, empathy and interpretation produces 
a dynamic revision of the mental state of the agent and the user: their beliefs, goals 
and affective state are updated as far as the dialog proceeds and influence activation of 
the plans the agent performs to achieve them. We will describe how this is done after 
shortly reminding the principles behind the information state approach to dialog 
simulation. 



5 The Information State (IS) Model 

This model was developed in the scope of the TRINDI EC Project [18] as a method to 
formalize modular dialog simulation systems with a plan-based approach. The infor- 
mation state (IS) is the blackboard in which all data needed to develop the dialog are 
stored; these data are revised dynamically by means of IS update rules', tasks to per- 
form and plans to achieve them are represented with a logical formalism; select rules 
establish the next plan and move to perform; a control strategy drives activation of 
rules; some external modules integrate the functions performed by the dialog manager 
[17]. 

In simulating our affective dialogs with an IS model, we developed the following 
components: IS structure, IS updating. Goals and plans and Interaction Manager. 



5.1 IS Structure 

The IS is divided, as always, in a 'shared' and a 'private' part. We upgraded its struc- 
ture with new data, which enable us to represent a BDl&E model of both the agent 
and the user (Table 2). The AGENT Model is a subrecord of the SHARED record and 
includes several fields: 

• four stacks (in GOAL), which represent its goals, labelled as 'short-term' or 
'long-term', 'achieved' and 'current'; 

• a set of propositions (in HISTORY) which represent the completed plans; 
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• a set of propositions (in IM, 'integrated move'), which represent the moves 
performed', 

• a couple of strings (in EMOTION), which represent the last emotion felt. 
These data are updated after every user move (section 4). 

Table 2. IS structure 



SHARED 



PRIVATE 





BEL 


Set{ Proposition ) 




QUD 


Stackset( Question ) 




LM 


Lm 


AGENT 


GOAL 


(Short-term. Long-term, Achieved, Current) 




HISTORY 


Set{ Proposition ) 




IM 


Set{ Action ) 




EMOTION 


Emotion 


USER 


STABLE 


Set{ Proposition ) 




UNSTABLE 


(StateOf Change, Valence, Arousal) 




AGENDA 


Stackset( Action ) 


PRIVATE 


PLAN 


Stackset( Action ) 




BEL 


Set{ Proposition ) 




TMP 


Shared 



The dialog history is stored in this part of the IS; this information drives applicability 
of IS update rules. The USER Model is also a subrecord of the SHARED record and 
includes two sets of propositions: 

• permanent characteristics (in the 'STABLE' part) which do not change in the 
course of the dialog: for instance, ‘name’, ‘age’ and ‘personality’; 

• transitory characteristics (in the 'UNSTABLE' part) which are revised during 
the dialog: for instance, the user's emotional state and her state of change. 



5.2 IS Updating 

To select the goals to achieve and the best plans to achieve them, our system needs to 
employ any information it has got so far about the user. Uncertain default information 
may be employed to guide the first steps of advice-giving, provided this blurred pic- 
ture of the user is refined in the course of the dialog, so as to refine also the advice 
provided. Our simulator therefore needs on one side an uncertain knowledge updating 
system and, on the other side, a description of the current situation of the two inter- 
locutors and of the dialog history, on which to ground its planning activity. 

The logical representation and inference mechanism of TRINDI is unsuited to repre- 
sent uncertainty in updating the user model and in interpreting her moves. To perform 
this function, we need some external components which process the static and dy- 
namic belief networks described in Section 4: we implemented these functions by 
means of Hugin's APIs [11]. 

At the same time, a set of updating rules enables us to import in the IS the results of 
propagating new evidence in these networks. We defined, in particular, ‘empathic’ 
rules to update the emotional state of the agent after a user move and ‘user updating’ 
rules to revise the ‘unstable’ part of the user model after interpreting her move. 




84 



A. Cavalluzzi, V. Carofiglio, and F. de Rosis 



5.3 Goals and Plans 

The goals described in Section 3 are not isolated entities: they are related by priority 
links which establish an order to follow in their consideration according to the fol- 
lowing strategies: 

• ‘recognize the situation’ is done until preliminary data have been acquired 
and the user’s state of change may be inferred with a probability above a 
given threshold; 

• ‘inform and encourage evaluation’ is done in all cases in which the subject 
does not seem to be really convinced that her behaviour is not correct; 

• ‘plan suggestion’ and ‘ability checking’ are conditioned to verifying that 
the subject has got the minimum required level of intention to change her 
behaviour; 

• ‘support to plan execution’ is conditioned to verifying that the subject did 
formulate a plan and is following at least its initial steps. 

Every goal may be achieved by means of one or more plans, each characterized by a 
set of applicability conditions on the user state of change and the emotional state of 
the two interlocutors. A plan Pi includes a main part (the essential of the plan) and a 
secondary part (with optional details). It may be linked to another plan Pj by a causal 
link, to represent that the execution of at least its main part should precede execution 
of Pj. Introduction of precedence relationships and distinction between main and sec- 
ondary parts are our solution to the problem of uncertainty in the knowledge of the 
user state. The system gives the user the opportunity to react to its moves, by frag- 
menting its plans into short moves and by interleaving suggestions with questions. The 
user may answer these questions or input new comments; both kinds of moves enable 
the system to reason on her state and on the emotional effect produced by its move. 
On the other side, as interpretation of user moves is controversial, dialogue plans are 
selected by default but may be revised as far as information acquired becomes more 
clear. When the user model is revised, an on-going plan may be interrupted, to be 
substituted with a new one. In this case, priority and causal relationships insure that a 
'believable' dialog flow is kept also after plans are changed. The distinction between 
main and secondary parts insures that a goal is achieved with plans at different level of 
detail, depending on the situation and the dialog history. 



5.4 Interaction Manager 

Effective advice-giving is favoured by a friendly interaction style. In the present pro- 
totype, the list of possible user moves in every dialog phase is displayed in a iconic 
menu, in which the user selects the control which best represents what she would like 
to say. As an ongoing next step of our research, input will be in free text, with a lin- 
guistic analysis to recognize emotional and state components; this evidence will be 
propagated in the belief network. 

The agent move may be displayed in textual form or may be pronounced by an em- 
bodied agent. The agent's body may be selected among a set of options as the most 
suited to the specific interaction context: GRETA [14], MS_Agents [12] or Haptek 
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[10]. To this aim, the agent move is transferred to an external body-animation module 
as an APML string [5] which activates a ‘wrapper’ to the player [7]. A graphical 
interface schedules the various functions, activates the related modules and manages 
input/output. 



6 The Two Examples Commented 

Let us see the dynamics of interpretation and planning in the example dialogs we in- 
troduced in Section 3: we show how the system's image of the user and the plans per- 
formed evolve in the course of the two dialogs. In both cases, the user is initially 
thought to be in the ‘contemplation’ state, with a probability above the threshold. The 
system then adopts the plan ‘Clarify, decision is yours’, which suits the goal activated 
by this state. After move U3, the system feels a light emotion of ‘sorry-for’, which 
activates a plan of ‘empathy display’: “I know what you mean”. The inferred state of 
change does not differ from the previous one and the inferred valence is negative. 
Hence, a plan of ‘encourage evaluation of pros and cons of behaviour change’ is per- 
formed. Comments about the Dialog 2 are similar. 



Dialog 1. The inferred state does not change 



User's state of change : Contemplation (p=25.65) 

S3: 1 may try to help you, but cannot take a decision in your piace. 

Plan : Clarify: the decision is yours 
U3: 1 would like to do something but it's so difficult! 

Move interpretation : (Say U 'change is difficult') 

System emotion : light sorry-for 

User's state of change : Contemplation (p=24.01) 

Valence : negative 

S4: 1 know what you mean: diet is difficult to follow, particularly if you eat out of home because 
of work. But the effects may be very positive! 

Plan : Display empathy; Encourage the evaluation of pros and cons of behaviour change 
U4: Are you sure? 

Move interpretation : (Say U 'doubts pros') 

System emotion : sorry-for increases of intensity 
User's state of change : Contemplation (p=25.65) 

Valence : negative 

SB: Of course! Being in good shape can have beneficial effects on your appearance and health! 
Plan : Encourage the evaluation of pros and cons of behaviour change (goes on) 
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Dialog 2. The inferred state is revised 



as in Dialog 1 

S4: I know, diet is difficult to follow, particuiady if you eat out of home because of work. But the 
effects may be very positive! 

Plan : Encourage the evaluation of pros and cons of behaviour change. 

U4: 1 know! I definitely intend to change my habits. 

Move interpretation : (Say U 'intends to change') 

System emotion : light hope 

User's state of change : Preparation (p=24.35) 

Valence : positive 

S5: Good! Why, for instance, don't you try to avoid daily lunches in a bar or high-fat sandwiches? 

Plan : Display satisfaction; Encourage initial steps 
U5: Everyday you mean? I may try, but must find an alternative. 

Move interpretation : (Say U 'will try' 8i Ask U 'how') 

System emotion : hope increases of intensity 
User's state of change : Preparation (p=31.12) 

Valence : positive 

S6: You may bring something from home: some fresh fruits or vegetables, for instances. And 
have a cup of tea or a fruit juice. 

Plan : Encourage initial steps (goes on) 



7 Final Remarks and Related Work 

In this paper, we advocate that, to simulate affective dialogs, recognition of the emo- 
tional state, the mood and the attitude of the user should be integrated with an inter- 
pretation of these features, which helps in understanding what produced this affective 
state and how the following course of the dialog should be adapted to this situation. As 
we said, so far we do not deal with the recognition problem (which, as it is well 
known, may be based on analysis of visual, prosodic and linguistic data) but only with 
their interpretation. Being aware that this interpretation is uncertain, we propose to 
adopt one of the typical models of uncertain reasoning: static and dynamic belief 
models [1]. We demonstrate how we integrated these models in an information-state 
approach to dialog modelling. The reason why we prefer this approach to state transi- 
tion models is that affective advice-giving dialogs are highly dynamic: inference about 
the affective state of the user may evolve during the dialog, and the system has to 
revise its behaviour to adapt to the situation. To this aim, it has not only to 'recognize' 
that the user is (say) 'demoralized' or 'confident’ about changing her behaviour, but 
also to infer which is the reason of this affective state: this will enable it to focus its 
repair strategy on the likely causes of the recognized situation. 
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The body of experience to which our work is more closely related is Cassell and 
Bickmore’s research on ‘relational agents’ [4] and, more specifically, LAURA [2], 
Although LAURA deals with advice about physical exercise rather than healthy eating, 
the background of our dialog simulators is the same as LAURA’s: in particular, Pro- 
chaska and colleagues’ state of change model. There are, however, several differences 
between our approaches and experiences. LAURA is aimed at providing long-term 
advice and follow-up: so, dialogs are repeated regularly to evaluate the effect of ad- 
vice provided in the previous phases and to assess whether and how the state of 
change of the subject evolved as expected. In every state, though, limited space is left 
to handling of uncertainty, no user model is built and the system feedback is not tai- 
lored to the emotional state of the user. Consequently, the interactions are represented 
as state-transition models and follow a predefined template. While this approach is 
powerful and efficient in the situation envisaged in LAURA, it is not so when informa- 
tion states are not finite in principle or are so many, that the complexity of the ATNs 
would become very high. It is not convenient also when the motivations for selecting 
the next dialogue move relies only on part of the information available, rather than on 
the whole state. This occurs when the goal is to adapt the dialog to the user’s state and 
also to the agent’s emotional reaction. 

As we said, so far we recognize the state of the user from a predefined list of inputs: in 
an ongoing research, we are studying how to infer it from linguistic analysis of free 
text. Other authors applied this method to infer 'personality traits' from analysis of 
emails [8] or combined prosodic and linguistic information to recognize the emotional 
state of the interlocutor [9] and the accuracy of recognition was rather high, if com- 
pared with the simplicity of the acquisition method. Parameters in our BN-based cog- 
nitive models have been settled after a purely subjective evaluation. However, state of 
change models are applied to innumerable cases of health advice, and local distribu- 
tions of people across the various stages may be obtained from health services: we 
therefore plan to refine calibration of our models after analyzing experimental data 
from this kind of sources. 

Acknowledgements. We thank Berardina Nadja De Carolis for helpful suggestions on 
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Abstract. If we wish to implement dialogue systems which express emo- 
tion, dialogue corpora annotated for emotion would be a valuable re- 
source. In order to develop such corpora we require a reliable annotation 
scheme. Here we describe an annotation scheme for emotion in dialogue 
using categorical labels to complement previous work using dimensional 
scales. The most difficult challenge in developing such a scheme is sel- 
ecting the categories of emotions that will yield the most expressive yet 
reliable scheme. We apply a novel approach, using a genetic algorithm 
to identify the appropriate categories. 



There is increasing recognition of a need to incorporate an understanding 
of emotion into dialogue systems, since this understanding can greatly enhance 
their performance. For example, artificial communicative agents can be made to 
converse more naturally and appear more engaging by having characters behave 
emotionally [1,2]. 

For gaining an understanding of the relationship between emotion and dialo- 
gue, which would allow us to implement such systems, dialogue corpora annota- 
ted for emotional content would be a valuable resource. To develop such corpora, 
it is first necessary to develop an annotation scheme that yields rich and reliable 
results. In this paper we will describe such a scheme which employs the common 
approach of descriptive labels applied to segments of dialogue. 



1 Annotating Emotion in Transcribed Speech 

To date, much of the work studying the relationship between speech and emotion 
has concentrated on the prosodic properties of speech. Although much emotional 
expression is conveyed in speech’s acoustic realisation, it is equally important to 
understand how it is conveyed in the linguistic content. For example, if we are 
to generate speech that will be perceived to be emotional, it is not sufficient to 
construct a semantically and syntactically suitable, yet impassive utterance and 
then impose emotional intonation onto it. In order to be realistic, the content of 
speech must also express emotion to some degree. 
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Besides complementing prosodic expressions of emotion in multi-modal dialo- 
gue systems, an understanding of the relationship between emotion and linguistic 
content will facilitate emotional dialogue systems where only the textual content 
is necessary or available. 

Since emotion is easier to identify with audio or visual context, it is temp- 
ting to annotate multi-modal dialogue and then draw conclusions about the 
relationship between the annotation and the linguistic content. However, since 
the annotation was not based solely on the information found in the linguistic 
content, the reliability of any conclusions drawn about this relationship must 
be doubted. In order to draw justifiable conclusions, the annotation must be 
applied to transcribed dialogue using a scheme evaluated in this context. 

2 Previous Studies 

Some studies have endeavoured to annotate emotion in dialogue. Some annotated 
general emotional states such as Positive and Negative [3,4], while others used 
specific labels for emotions [5]. 

Since part of the aim of the work using emotional states was to develop 
systems to detect the emotion of speech based on its acoustic features, it was 
necessary to make only very coarse distinctions. In order to study the relations- 
hip between emotion and speech it would be valuable to make finer distinctions, 
allowing more detailed analysis. Choosing specific emotion categories for an an- 
notation scheme is a difficult task. Laurence Devillers’ scheme [5] contained labels 
that were specific to the call centre application for which it was developed, and 
it is unclear whether these labels would be useful in a broader context. 

The scheme that we propose incorporates both these types of annotation, but 
differs from previous attempts by allowing annotators to make finer distinctions 
for emotional states, and by including labels that can most easily be identified 
within transcribed dialogue. 

3 Expressions of Emotion 

Since emotion is a very general term that may refer to a number of different 
things, we need to describe more specifically what we wish to annotate. Roddy 
Cowie distinguishes two types of descriptions of emotions for speech studies, 
cause-type and effect-type [6]. This is similar to Austin’s distinction between 
Illocutionary acts and their Perlocutionary effects [7]. 

Cause-type descriptions relate to the ‘internal states and external factors that 
caused a person’s speech to have particular characteristics’. Analysis of dialogue 
annotated for this type of phenomena would be useful for detecting a speaker’s 
emotion based on their speech. An example of a type of dialogue system that 
would benefit from this understanding would be those that wish to recognise user 
emotion and adapt their behaviour accordingly. If we annotate emotion in this 
form, we are attempting to guess the emotion that the speaker was experiencing 
whilst speaking. Since humans can successfully disguise their emotions or falsely 
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exhibit others, this would be a very difficult task, and there is no way of knowing 
if our annotation is correct. 

Effect-type descriptions ‘describe what effect [the characteristics of speech] 
would be likely to have on a typical listener’. We could annotate this type of 
phenomenon by labelling speech for the emotion that we perceive as being ex- 
pressed, regardless of whether the speaker was experiencing, or even trying to 
express that emotion. An understanding of the relationship between the content 
of speech and the emotion that listeners perceive within it, would be especially 
useful for automatically generating emotional speech. Furthermore, since annot- 
ators are labelling the effect that the speech had on them rather than guessing 
the emotion experienced by the speaker, the annotation is more objective and 
intrinsically valid for each annotator. 

Because of the practical uses of effect-type annotation and also because it is 
likely to produce more valid and reliable results, our annotation scheme is used 
to label the perceived expression of emotion in dialogue. 



4 Developing an Annotation Scheme for Emotion in 
Dialogne 



Designing annotation schemes and evaluating their quality is rarely a trivial 
task. For subtle, rare and subjective phenomena the task becomes substantially 
more difficult. The two main challenges that must be overcome are - 

Reliability. The data that an annotation scheme produces must be shown to 
be reliable before its analysis is valid. Broadly, reliability reflects the clarity 
of the mapping of units of data onto categories which describe that data. In 
turn, this reflects the degree to which there is a shared understanding of the 
meaning of the phenomena in question. Reliability can be inferred from the 
level of agreement achieved by a number of coders labelling the same data. 
An overview of agreement statistics for discourse and dialogue coding can 
be found in [8] and [9]. 

Labelling emotion is frequently referred to as a ‘subjective’ process, meaning 
that the mapping of data to categories is abstruse. The greatest challenge 
when developing a scheme for emotion in dialogue is designing it in such 
a way that the data that it produces is sufficiently reliable. We tackle this 
problem by attempting to identify labels for emotions upon which annotators 
can agree. 

Coverage. Coverage refers to the proportion of units of data that have labels 
applied to them during the annotation process. Since obtaining dialogue 
corpora is difficult and annotation time can be expensive, it is desirable for 
the application of an annotation scheme to result in as much labelled data 
as possible. Since episodes of strong emotion are rare in spoken dialogue, 
obtaining sufficiently high rates of coverage is another challenge in developing 
an annotation scheme for this phenomenon. 
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In order to assess how difficult it would be to develop a scheme which over- 
came these challenges, we developed a trial scheme containing labels for emotions 
used in psychological research; Courage, Dejection, Sadness, Disgust, Aversion, 
Shame, Anger, Surprise, Guilt, Wonder, Hate, Affection, Happiness, Desire, 
Contempt and Fear [10]. Four annotators used this scheme to label a dialogue 
containing 400 utterances, from our corpus of Cancer Patient/ Nurse conver- 
sations [11]. The results were discouraging, with only an average of 16% of the 
utterances labelled with an overall agreement level of 0.17^. 

The disappointing results from this trial led us to adopt an alternative ap- 
proach to describing emotion, using abstract numerical scales. The results of this 
were much more encouraging. A description of this scheme was published in [13] 
and is summarised here. 

4.1 A Two Dimensional Annotation Scheme for Emotion in 
Dialogue 

This scheme is based on the notion that properties of emotions can be descri- 
bed as points on a numerical scale. Our approach is similar to the Activation- 
Evaluation space coding [14] used in the Feeltrace application [15] to track the 
emotion of speakers in multi-modal dialogue. In this application the perceived 
emotion of a speaker is tracked in two dimensions; Activation, which describes 
the degree to which that emotion inspires action in humans, and Evaluation 
which describes how positive or negative one might consider that emotion to be. 

In order to produce a practical and usable scheme for dialogue we adap- 
ted this approach in a number of ways. Firstly, dialogues are segmented into 
utterances and values are applied to each individual utterance. One benefit of 
employing utterances is that the majority of other annotation schemes also use 
these as their basic unit, and aligning our scheme with others will allow us to 
make comparisons between layers of annotation. Also, since utterances are a con- 
venient unit for the generation of dialogue, labelling at this granularity makes 
the results of analysis easier to apply. Although expressions of emotion do not 
always align with utterance boundaries, asking annotators to segment dialogue 
into smaller units would increase the complexity of the coding process, especially 
considering that the boundaries of emotional expression are rarely clear. 

When applying our scheme, annotators are restricted to applying one pair 
of values for each utterance. Although it is possible to express more than one 
emotion within a single utterance it is relatively rare for speakers to do so. 
In an experiment that we conducted, in which annotators labelled emotions in 
a dialogue using labels of their own choosing, around only 3% of utterances 
required more than one label. Allowing annotators to apply more than one pair 
of values per utterance increases the complexity of the annotation process for 
little benefit. 

^ Agreement was measured using Krippendorff’s alpha statistic [12]. The value should 
be interpreted as a level of agreement between 0 and 1, where 1 represents perfect 
agreement and 0 suggests that the coders did not understand the task and behaved 
randomly. 
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The next adaptation was that instead of Activation, our scheme used an 
Intensity dimension. Whereas Activation refers to the arousal of the person ex- 
periencing the emotion, it is not clear how this relates to the perceived expression 
of emotion within speech. Intensity describes the overall level of expression wit- 
hin an utterance, and this conceptually simpler dimension should be more easily 
understood and applied by annotators. It is also likely that intensity will serve 
as a more useful parameter for dialogue generation systems in which the level of 
emotional expression can be adjusted. 

Finally, we wished to bestow some meaning on the values applied by the 
annotators. During the development of Feeltrace, it was recognised that coders 
performed more reliably when the locations of specific emotions were placed on 
their two dimensional plane. The introduction of reference points implies some 
meaning to the values within the dimensional space. We also suggest that we 
need to introduce some notion of scale to the dimensions, without which, values 
away from the reference points become difficult to interpret. 

Reference points and scale are introduced into our two dimensional scale 
implicitly by dividing the continuous dimensional space into Likert-scale style 
sets of discrete values. Level can be a value from 0 to 4 and evaluation, —3 to 
-|-3. This allows us to describe the meaning of each value and give examples of 
circumstances in which that value should be used (see section 6). Making the 
meaning of each value explicit also should reduce the subjectivity in the coding 
process. 



4.2 Why the Need for a Categorical Scheme for Annotating 
Emotion in Dialogue? 

The dimensional model was used to create our annotation scheme because of 
the difficulty we observed in developing a categorical scheme which would elicit 
satisfactory reliability and coverage. While we believed that agreement could 
be increased by concentrating on a selection of emotions that could be reliably 
identified by annotators, this would reduce coverage to intolerable levels. 

The prospect of a categorical annotation scheme for expression of emotion 
remains attractive. If we can understand how a small number of emotions are 
expressed in speech, this could be exploited to make artificial communicative 
agents more realistic. There are cases in which it is possible to identify specific 
emotions, ‘surprise’ being a common example. It is a shame that even though 
annotators may agree that an utterance expresses a specific emotion they are 
limited to using numeric values to describe it. 

Now that we have a scheme that allows annotators to label every utterance 
for its emotional expression regardless of how subtle it may be, we may augment 
that scheme with nominal labels for emotions that can be reliably identified 
within transcribed dialogue. 
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4.3 Developing the Categorical Annotation Scheme 

A categorical annotation scheme for emotion may be considered to be a collection 
of labels for emotions which annotators may apply to individual utterances. 
A wide range of different lists of emotions has been proposed in psychological 
research, from the popular big six (anger, fear, happiness, sadness, surprise, 
and disgust) to larger lists of emotional related states. Choosing an appropriate 
list for an annotation scheme would seem a daunting process. However we can 
be guided by the proposed application of the analysis of the annotated data 
(developing an understanding of how to express emotion in speech) . We might 
consider that the most useful emotions to annotate would be those that can most 
easily and reliably identified by humans. 

To derive a suitable list of emotions, an experiment was conducted in which 
four annotators labelled three dialogues each of around 400 utterances. Instead 
of being prescribed a set of labels, they were allowed to use any labels that they 
believed appropriately described the emotions expressed. They were also allowed 
to use more than one label per utterance if necessary. The results look something 
like this - 

Nl: Right, how are you this morning [] 

P2: Not too brilliant [despondency] 

N3: No? What’s the problem [interest] 

P4: I’ve had some right bad breathing do’s [anxiety] 



Over 100 different labels were invented by the annotators. Permitting them 
to choose labels for individual utterances resulted in the chosen labels being very 
fine grained and specific (e.g. Disgruntled, Reflective, Resolute). Some of 
them were inappropriate for our annotation scheme because what they describe is 
not clearly an emotion (e.g. Friendly, Confirmation). In order to arrive at our 
final list of emotions we can group these fine-grain labels into broader categories 
and ignore the ones that are not appropriate or cause too much disagreement. For 
example, if the labels Anxiety, Concern and Worry cause disagreement between 
annotators we could achieve greater reliability by combining the three into a 
single label. 

This grouping and ignoring can be considered as a search procedure, where we 
are attempting to find an arrangement of labels into groups which results in the 
highest level of agreement. An appropriate way of finding suitable arrangements 
is to employ the artificial intelligence heuristic search procedure known as a 
genetic algorithm. 

Genetic algorithms can explore very large search spaces by applying a ‘sur- 
vival of the fittest’ evolution process to candidate solutions to a problem. Solu- 
tions to our problem consist of an arrangement of labels into groups, including 
an ignore-me group for labels that will not appear in the final scheme. Our 
algorithm generates a random population of candidate solutions and at each ite- 
ration, discards a proportion of the weakest solutions then refills the population 
by combining and mutating the better solutions. 
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We applied the state of the art genetic algorithm SPEA2 [16], which simply 
required us to write a function that would return a value denoting the fitness of 
any given candidate solution. The simplest function for this purpose could return 
the level of agreement that would be achieved, if instead of using the individual 
labels, annotators used a single label, common to all members of the group in 
which the label was placed. Unfortunately this simplistic function encouraged 
the algorithm to ignore lots of labels and over-zealously group other labels into 
very large collections. 

In order to yield more useful results, the algorithm was was asked to satisfy 
multiple objectives - 

Agreement. Solutions were rewarded for resulting in high levels of agreement 
between annotators. 

Coverage. Higher scores were awarded when greater numbers of utterances 
were labelled using the solution’s scheme. This discouraged the algorithm 
from ignoring too many labels. 

Entropy. Solutions which balanced the frequency with which each group of la- 
bel were used, were preferred to those with an unequal distribution. This 
discouraged the algorithm from forming super-groups by bundling many la- 
bels together. 

Coherence. In an attempt to balance the distribution of groups to satisfy the 
entropy objective, the algorithm tended to merge groups of unrelated emo- 
tions. To avoid this, each individual emotion was labelled using the dimen- 
sional scale described in section 4.1 and this was used in the calculation of 
a group’s coherence. A solution’s coherence score is penalised for placing 
disparate labels within a group. 

When searching for high quality solutions using more than one objective, 
we are attempting to find the Pareto optimal solutions. These are only those 
solutions for which there is no other solution which is superior for every objective. 

For each arrangement into groups we can derive a list of emotions by selecting 
a label for each group which adequately describes the emotions within it. The 
choice as to which label would best describe the group was a personal decision 
but for most cases the appropriate label was obvious. For example if it was 
suggested that we group Anxiety, Worry, Concern and Trepidation we 
may decide to use the label Worry to describe that group. 

5 The Categorical Annotation Schemes for Emotion in 
Dialogue 

Rather than selecting one definitive list of labels that are to be used in our annot- 
ation scheme, we have chosen three, each with fewer, coarser grained categories 
than the last. These are shown in table 1 and are hereafter referred to as the 
fine-grained, medium-grained and coarse-grained schemes. 

The labels that constitute the fine-grained group schemes are those that 
appeared most frequently in the Pareto optimal set of solutions. It also includes 
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Affection and Anger since for the experiment described in section 4 these two 
emotions caused considerably less disagreement than the others. The medium- 
grained and coarse-grained schemes were created by conflating categories that 
were sometimes grouped by the genetic algorithm. Specifically, Worry and Fear 
became Worry; Contentment and Joy became Happiness; Dislike and Misery 
became Sadness; Positivity, Happiness and Affection became Happiness 
and Anally Sadness and Anger became Anger. Frustration and Positivity 
were removed from the fine and medium grained schemes respectively as part of 
the reduction process. 



Fine 


Medium 


Coarse 


Anger 


Anger 


Happiness 


Affection 


Affection 


Sadness 


Gontentment 


Hope 


Worry 


Dislike 


Happiness 


Hope 


Frustration 

Fear 

Hope 

Joy 

Misery 

Positivity 

Surprise 

Worry 


Positivity 

Sadness 

Surprise 

Worry 


Surprise 



Fig. 1. The three candidate collections of labels 



5.1 Evaluating the Annotation Schemes 

We next assessed the quality of the proposed annotation schemes by measuring 
the reliability of the data that they produce using an inter-rater agreement 
measure, namely Krippendorff’s Alpha^ [12]. 

For each scheme eleven annotators (10 for the medium-grained scheme) an- 
notated a dialogue of 52 utterances. They followed written instructions descri- 
bing how to perform the annotation with a definition of each label (see section 
6). This dialogue was distilled from anonymised extracts of our corpus. While 
the results of Alpha are not a function of the data being annotated, it was im- 
portant to encourage the use of as many different labels as possible so that the 
overall reliability of the scheme can be inferred from the agreement results. For 

^ Since these schemes allow more than one label to be applied to each utterance, neither 
Kappa, nor Alpha in its original form will correctly calculate agreement between 
annotators applying them. Together with Klaus Krippendorff, we have designed and 
implemented an extension of Alpha which works in these circumstance. The details 
of this measure are yet to be disseminated. 
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this reason the extracts that made up to dialogue were those that contained a 
range of emotions. 

The results of the Alpha test on each of the scheme were as follows - 

Fine Grained a = 0.329 

Medium Grained a = 0.438 

Goarse Grained a = 0.443 

Reliability is inferred from the level of agreement observed in an annotation 
based on the degree to which we are willing to rely on imperfect data [17]. It 
has become common practise in computational linguistics to measure reliability 
against Krippendorff’s criterion, with schemes that surpass agreement figures of 
0.667 being considered ‘reliable’ and worthy of use. This is an unfortunate and 
dangerous mis-interpretation of Krippendorff’s work. 

The appropriate way in which reliability should be inferred is that the at- 
tained level of agreement should dictate the applications to which the resulting 
annotated data can be applied. Although the agreement shown for our schemes 
is not strong, this does not mean that the schemes should not be used, only that 
any conclusions made from resulting data must be based on strong evidence 
to counter the imperfections in that data. It is likely that the reliability that 
there schemes display could be increased by making improvements to the coding 
manual and by training the annotators. 



6 A Closer Look at Our Final Scheme 

As described previously, the collection of labels that have been chosen are used 
to supplement the numerical scheme for annotating emotion in dialogue. The 
procedure for annotating dialogue, segmented into utterances, using this hybrid 
scheme is as follows - 

1. For each utterance label the overall level and polarity of the expression of 
emotion using the following guidelines - 

Level 

0 No emotion or it is impossible to tell - “So how are you?” 

1 Not totally lacking in emotion, (a hint of) - “I suppose so” 

2 low level, but apparent - “I’m not feeling too great” 

3 Glear expression of emotion - “Oh she’s annoying that girl” 

4 Strong expression of emotion - “I can’t hear to talk about it” 

Evaluation 

-3 Wholly/Strong negative - “It was the most awful feeling” 

-2 Glearly negative - “He tries, but he keeps messing it up” 

-1 Perhaps negative (but not positive) - “You know, the stupid one” 
neutral Neutral or impossible to tell - “He’s moved to Blackburn” 

+1 Perhaps positive (but not negative) - “Oh yes, that new show” 

+2 Glearly positive - “That’s a nice view” 

+3 Wholly/Strong positive - “Oh that is wonderful news” 
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2. If one or more of the following labels apply to the expression of emotion in the 
utterance then add those label to the annotation. If none of the labels apply 
then leave the utterance unlabelled. (The actual emotions to list depends on 
the chosen granularity of the scheme, all descriptions are given below) 

Anger - The speaker expresses that a certain situation or person has upset 
them such that they feel passionately about it. 

Affection - The speaker expresses a liking or love for something. 

Hope - The speaker expresses emotion due to the anticipation of something 
good happening. 

Happiness - The speaker expresses a generally positive feeling. 

Positivity - The speaker expresses a wish to avoid sadness caused by a 
situation. This includes the like of bravery, desire and determination. 

Sadness - The speaker expresses that a situation, person, memory etc. is 
making them unhappy without necessarily being motivated to do anything 
about it. 

Surprise - The speaker expresses that something unexpected has affected 
them. 

Worry - The speaker expresses that uncertainty about the future is negati- 
vely affecting them. 

Contentment - The speaker expresses satisfaction or gratification. 

Dislike - The speaker expresses disapproval or aversion toward a situation, 
person or object without necessarily being motivated to do anything about 
it. 

Frustration - The speaker expresses that their inability to achieve some- 
thing is upsetting them. 

Fear - The speaker expresses a disposition caused by anticipation of some- 
thing bad happening. 

Joy - The speaker expresses a positive feeling which they intend to enjoy. 

Misery - The speaker expresses that a situation or person is making them 
unhappy without necessarily being motivated to do anything about it. 



7 Conclusion 

In this paper we have described an annotation scheme for labelling expressions 
of emotion in dialogue. We recognised that reliably identifying emotion is a 
difficult task, but by combining a categorical annotation scheme with another 
using dimensional scales we could include a select group of labels which can most 
easily be identified by human annotators. 
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We proposed three different schemes, each with a different level of granularity. 
Although the overall level of agreement for each of these schemes was well below 
ideal, it was evident that the finer the distinctions between different emotions, the 
more difficult it was for annotators to agree. Under these circumstances, having 
more than one scheme allows us to choose a scheme which is most appropriate 
for each task, based on the level of granularity required and the degree to which 
we are willing to rely on imperfect data. 

By annotating dialogue corpora for emotional expression we hope that it 
is possible to gain an understanding of the factors that contribute to listeners 
believing that emotion is being expressed in speech. This understanding may 
be applicable to a range of tasks, and we identify the generation of emotional 
speech by artificial communicative agents as a potential beneficiary. 
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Abstract. Audio/visual speech, in the form of labial movement and facial ex- 
pression data, was utilized in order to semi-automatically build a new Italian 
expressive and emotive talking head capable of believable and emotional be- 
havior. The methodology, the procedures and the specific software tools util- 
ized for this scope will be described together with some implementation exam- 
ples. 



1 Introduction* 

It is quite evidently documented by specific workshops and conferences (AVSP, 
LREC), European (FP6, Multimodal/Mulsensorial Communication, R&D) and Inter- 
national (COCOSDA, ISLE, LDC, MITRE) framework activities, and by various 
questionnaires (see ISLE and NIMM, ELRA, COCOSDA, LDC, TalkBank, Dagstuhl 
Seminar [1]) that data-driven procedures for building more natural and expressive 
talking heads are becoming popular and successful. 

The knowledge that both acoustic and visual signal simultaneously convey lin- 
guistic, extra linguistic and paralinguistic information is rather spread in the speech 
communication community, and it constitutes the basis for this work. The data-driven 
procedure utilized to build a new Italian talking head, described in this work, has 
been, in fact, directly driven by audio/visual data, in the form of labial movement and 
facial expression data, that were physically extracted by an automatic optotracking 
movement analyzer for 3D kinematics data acquisition called ELITE [2]. 



' Part of this work has been sponsored by COMMEDIA (COMunicazione Multimodale di 
Emozioni e Discorso in Italiano con Agente animato virtuale, CNR Project C00AA71), PF- 
STAR (Preparing Future multiSensorial inTerAction Research, European Project 1ST- 2001- 
37599, http://pfstar.itc.it ) and TICCA (Tecnologie cognitive per I'lnterazione e la 
Cooperazione Con Agenti artificiali, joint “CNR - Provincia Autonoma Trentina” Project). 
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1.1 AudioA^isual Acquisition Environment 

ELITE is a fully automatic movement analyzer for 3D kinematics data acquisition, 
that provides for 3D coordinate reconstruction, starting from the 2D perspective pro- 
jections, by means of a stereophotogi'ammetric procedure which allows a free posi- 
tioning of the TV cameras. The 3D data coordinates are then used to calculate and 
evaluate the parameters described hereinafter. Two different configurations have been 
adopted for articulatory data collection: the first one, specifically designed for the 
analysis of labial movements, considers a simple scheme with only 8 reflecting mark- 
ers (bigger grey markers on Figure la) while the second, adapted to the analysis of 
expressive and emotive speech, utilizes the full and complete set of 28 markers. 
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Fig. 1. Position of reflecting markers and reference planes for the articulatory movement data 
collection (a); speech signal and time evolution of some labial kinematic parameters ( LO, LR, 
ULP, LLP, UL, LL, ASYMX and ASYMY, see text) for the sequence /’aba/ (b). 



All the movements of the 8 or 28 markers, depending on the adopted acquisition 
pattern, are recorded and collected, together with their velocity and acceleration, si- 
multaneously with the co-produced speech which is usually segmented and analyzed 
by means of PRAAT [3], that computes also intensity, duration, spectrograms, for- 
mants, pitch synchronous FO, and various voice quality parameters in the case of 
emotive and expressive speech [4-5]. As for the analysis of the labial movements, the 
most common parameters selected to quantify the labial configuration modifications, 
as illustrated in Figure lb for some of them, are introduced in the following Table: 
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Table 1. Meaning of some of the most common chosen articulatory parameters. 



• Lip Opening (LO), calculated as the distance between markers placed on the central points of the 
upper and lower lip vermillion borders [d(m2,m3)]; this parameter correlates with the HIGH-LOW 
phonetic dimension. 



• Lip Rounding (LR), corresponding to the distance between the left and right comers of the lips 
[d(m4,m5)], which correlates with the ROUNDED-UNROUNDED phonetic dimension: negative 
values correspond to the lip spreading. 



• Anterior/posterior movements (Protmsion) of Upper Lip and Lower Lip (ULP and LLP), calcu- 
lated as the distance between the marker placed on the central points of either the upper and lower lip 
and the frontal plane A containing the line crossing the markers placed on the lobes of the ears and 
perpendicular to D plane [d(m2,A), d(m3,A)]. These parameters correlate with the feature 
PROTRUDED-RETRACTED: negative values quantify the lip retraction. 



• Upper and Lower Lip vertical displacements (UL, LL), calculated as a distance between the mark- 
ers placed on the central point of either upper and lower lip and the transversal plane passing 
through the tip of the nose and the markers on the ear lobes [d(m2,f2), d(m3,D)]. Hence, positive 
values correspond to a reduction of the displacement of the markers from the D plane. As told before, 
these parameters are normalized in relation to the lip resting position. 



2 Data-Driven Methodology and Tools 

As explained in [6-8], several Audio/Visual corpora, were used to train our MPEG-4 
[9] standard talking head called LUCIA [10] speaking with an Italian version of 
FESTIVAL TTS [11], 

2.1 Model Estimation 

The parameter estimation procedure for LUCIA’ s model is based on a least squared 
phoneme-oriented error minimization scheme with a strong convergence property, 
between real articulatory data Y(n) and modeled curves F(n) for the whole set of R 
stimuli belonging to the same phoneme set : 

where F(n) is generated by a modified version of the Cohen-Massaro co-articulation 
model [13] as introduced in [6-7]. Even if the number of parameters to be optimized is 
rather high, the size of the data corpus is large enough to allow a meaningful estima- 
tion, but, due to the presence of several local minima, the optimization process has to 
be manually controlled in order to assist the algorithm convergence. The mean total 
error between real and simulated trajectories for the whole set of parameters is lower 
than 0.3 mm in the case of bilabial and labiodental consonants in the /a/ and /i/ con- 
texts [14, p. 63]. 
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2.2 MPEG4 Animation 



In MPEG-4 [9], FDPs {Facial Definition Parameters) define the shape of the model 
while FAPs {Facial Animation Parameters), define the facial actions. Given the shape 
of the model, the animation is obtained by specifying the FAP-stream that is for each 
frame the values of FAPs (see Figure 2). In a FAP-stream, each frame has two lines of 
parameters. In the first line the activation of a particular marker is indicated (0, 1) 
while in the second, the target values, in terms of differences from the previous ones, 
are stored. 

In our case, the model uses a pseudo-muscular approach, in which muscle contrac- 
tions are obtained through the deformation of the polygonal mesh around feature 
points that correspond to skin muscle attachments 
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Fig. 2. The FAP stream. 



Each feature point follows MPEG4 specifications where a FAP corresponds to a 
minimal facial action. When a FAP is activated (i.e. when its intensity is not null) the 
feature point on which the FAP acts is moved in the direction signalled by the FAP it- 
self (up, down, left, right, etc). 

Using the pseudo-muscular approach, the facial model’s points within the region 
of this particular feature point get deformed. A facial expression is characterised not 
only by the muscular contraction that gives rise to it, but also by an intensity and a du- 
ration. The intensity factor is rendered by specifying an intensity for every FAP. The 
temporal factor is modelled by three parameters: onset, apex and offset [15]. 

The FAP-stream needed to animate a FAE {Facial Animation Engine) could be 
completely synthesized by using a specific animation model, such as the co- 
articulation one used in LUCIA, or it could be reconstructed on the basis of real data 
captured by an optotracking hardware, such as ELITE. 
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2.3 Tools: “INTERFACE” 

In order to speed-up the procedure for building-up our talking head an integrated 
software called INTERFACE, whose block diagram is illustrated in Figure 3, was de- 
signed and implemented in Matlab©. INTERFACE simplifies and automates many of 
the operation needed for that purpose. 

The whole processing block is designed in order to prepare the correct wav and 
FAP files needed for the animation engines, both in the sense of building up the en- 
gines and of truly creating the current wav and FAP file needed for the final anima- 
tion. The final animation, in fact, can be completely synthesized starting from an input 
emotional tagged text, by the use of our animation engine [13], or it can be reproduced 
by using the data, relative to the specific movements of the markers positioned on hu- 
man subjects, extracted by ELITE. 



Processing 



:a 



Track 


1 Optim. 


Ifdcin 


1 Mavis 



3*1 



XML editing 



expml2fap 



FESTIVAL 



FacePlayer 





Fig. 3. INTERFACE block diagram (see text for details) 



INTERFACE, handles three types of input data from which the corresponding 
MPEG4 compliant FAP-stream could be created: 

• low-level data, represented by the markers trajectories captured by ELITE; 
these data are processed by 4 programs: 

• “TRACK”, which defines the pattern utilized for acquisition and imple- 
ments the 3D trajectories reconstruction procedure; 

• “OPTIMIZE” that trains the modified co-articulation model [13] utilized 
to move the lips of GRETA [6] and LUCIA [10], our two current talking 
heads under development; 
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• “IFDCIN”, that allows the definition of the articulatory parameters in rela- 
tion with marker positions, and that is also a DB manager for all the files 
used in the optimization stages; 

• “MAVIS” {Multiple Articulator VISualizer, written by Mark Tiede of 
ATR Research Laboratories [16]) that allows different visualizations of ar- 
ticulatory signals; 

• symbolic high-level XML text data, processed by: 

• “XML-EDITING”, an emotional specific XML editor for emotion tagged 
text to be used in TTS and Facial Animation output; 

• “EXPML2FAP”, the main core animation tool that transforms the tagged 
input text into corresponding WAVand FAP files, where the first are synthe- 
sized by FESTIVAL and the last, which are needed to animate the MPEG4 
engines GRETA or LUCIA [11], by the optimized animation model (de- 
signed by the use of OPTIMIZE); 

• single low-level FAPs, created by: 

• “XML-EDITING”, (see above); 

and edited by 

• “FACEPLAYER”, a direct low-level manual control of a single (or group 
of) Fap; in other words, FACEPLAYER renders what happen, in GRETA 
and LUCIA, while acting on MPEG4 FAP points for a useful immediate 
feedback. 

The TrackLab software originally supplied by BTS© [17] for ELITE is not reli- 
able in reconstructing 3D trajectories when there are a lot of very quickly varying 
markers close to each other, as it usually happens in the articulatory study of facial 
expressions. The TRACK MatLab© software was, in fact, developed with the aim of 
avoiding marker tracking errors that force a long manual post-processing stage and 
also a compulsory stage of markers identification in the initial frame for each used 
camera. TRACK is quite effective in terms of trajectories reconstruction and process- 
ing speed, obtaining a very high score in marker identification and reconstruction by 
means of a reliable adaptive processing. Moreover only a single manual intervention 
for creating the reference tracking model (pattern of markers) is needed for all the 
files acquired in the same working session. TRACK, in fact, tries to guess the possi- 
ble target pattern of markers, as illustrated in Figure 4, and the user must only accept a 
proposed association or modify a wrong one if needed, then it runs automatically on 
all files acquired in the same session. 

Moreover, we let the user the possibility to independently configure the markers 
and also a standard FAP-MPEG. The actual configuration of the FAP is described in 
an initialization file and can be easily changed. The markers assignment to the MPEG 
standard points is realized with the context menu as illustrated in Figure 5. 
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Fig. 4. Definition of the reference model. TRACK’S marker positions and names are associated 
with those corresponding to the real case. 




l ^interFace: Marker- Fap conf ig uration manager 



I Marker n. 20 - Ad associated to FAP n. 54 (Horizontal displacement of right outer lip corner), n. 60 [Vertical displacement of right outer lip corner] 



Fig. 5. Marker MPEG-FAP association with the TRACK’S reference model. The MPEG refer- 
ence points (on the left) are associated with the TRACK’S marker positions (on the right). 
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In other words, as illustrated in the examples shown in Figure 6, for LUCIA, 
TRACK allows 3D real data driven animation of a talking face, converting the ELITE 
trajectories into standard MPEG4 data and eventually it allows, if necessary, an easy 
editing of bad trajectories. Different MPEG4 FAEs could obviously be animated with 
the same FAP-stream allowing for an interesting comparison among their different 
renderings. 




Fig. 6. Examples of a single-frame LUCIA’ s emotive expressions. These were obtained by ac- 
quiring real movements with ELITE, by automatically tracking and reconstructing them with 
“TRACK” and by reproducing them with LUCIA. 



3 Visual Emotions 

At the present time, emotional visual configurations are designed and refined, by 
means of visual inspection of real data, with a software called EMOTIONAL 
PLAYER (EP) (see Figure 7), designed and implemented in Matlab© on the basis of 
FACIALPLAYER, introduced above in 2.3, and greatly inspired by the Emotion Disc 
software [18]. In the future, a strategy similar to that introduced in 2.1 will be adopted. 
EMOTIONAL PLAYER manages single facial movements of a synthetic face in a 
standard MPEG-4 framework in order to create emotional and expressive visual ren- 
derings in GRETA and LUCIA. 

As already underlined above in 2.2, in MPEG-4 animations, FDPs define the shape 
of the model while FAPs define the facial actions. The intensity and the duration of an 
emotive expression are driven by an intensity factor that is rendered by specifying an 
intensity for every FAP, and by a temporal factor which is modelled by onset, apex 
and offset parameters, as explained in [15]. 

The onset and offset represent, respectively, the time the expression takes to appear 
and to disappear; the apex corresponds to the duration for which the facial expression 
is at its peak intensity value. These parameters are fundamental to convey the proper 
meaning of the facial expressions. In our system, every facial expression is character- 
ised by a set of FAPs. Every set of FAPs allows for example the creation of the 6 fa- 
cial expressions coiTesponding to the 6 basic primary emotions of Ekman’s set (Table 
2), chosen here for a sake of simplicity, and for every expression only 3 levels of in- 
tensity (low, medium, high) have been simulated. 
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Fig. 7. EMOTIONALPLAYER. 

Table 2. The 6 basic primary emotions of Ekman’s set with corresponding facial expressions. 



Expres- 

sion 


Description 


Anger 


The inner eyebrows are pulled downward and together. The eyes are wide open. The lips are 
pressed against each other or opened to expose the teeth 


Fear 


The eyebrows are raised and pulled together. The inner eyebrows are bent upward. The eyes 
are tense and alert. 


Disgust 


The eyebrows and eyelids are relaxed. The upper lip is raised and curled, often asymmetri- 
cally. 


Happi- 

ness 


The eyebrows are relaxed. The mouth is open and the mouth comers pulled back toward the 
ears. 


Sadness 


The inner eyebrows are bent upward. The eyes are slightly closed. The mouth is relaxed. 


Surprise 


The eyebrows are raised. The upper eyelids are wide open, the lower relaxed. The jaw is 
opened 



In our system we distinguish “emotion basis" FBftJ from “emotion display" ED(t). 
They are both functions of the time t. An EB(t) involves a specific zone of the face 
such as the eyebrow, mouth, jaw, eyelid and so on. EB(t) includes also facial move- 
ments such as nodding, shaking, turning the head and movement of the eyes. Each 
EB(t) is defined as a set of MPEG-4 compliant FAP parameters: 

EB(t) = {fap3 = vj(t); \fap68 = v^(t)} 

where vi(t),. . . ,vpt) specify the FAPs function intensity value created by the user. An 
EB(t) can also be defined as a combination of EB’(t) by using the '+' operator in this 
way: 

EB’(t) =EBi’(t)+ EB2’(t) 

The emotion display is finally obtained by a linear scaling: 
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ED’(t) =EB(t)*c = {fap3 = V](t)*c; ',fap68 = v\^(t)*c)} 



where EB is a “facial basis” and 'c' a constant. The operator multiplies each of the 
FAPS constituting the EB by the constant 'c'. The onset, offset and apex (i.e. the dura- 
tion of the expression) of emotion is determined by the weighed sum of the functions 
vpt) (k = 3,.. .,68) created by mouse actions. In Figure 8, two simple emotional exam- 
ples for fear and happiness are illustrated. 






Fig. 8. Fear (top) and happiness (bottom) emotional examples. 



4 Concluding Remarks 

An integrated software environment designed and developed for the acquisition, crea- 
tion, management, access, and use of audio/visual (AV) articulatory data, captured by 
an automatic optotracking movement analyzer, has been introduced and described in 
its general characteristics. These methods, tools, and procedures can surely accelerate 
the development of Facial Animation Engines and in general of expressive and emo- 
tive Talking Agents. 
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5 Future Trends 

Evaluation should be strongly carried out in the future and evaluation tools will be in- 
cluded in these tools. Perceptual tests, for example, for comparing both the original 
videos/signals and the talking head can surely give us some insights about where and 
how the animation engine could be improved. 

Results from a preliminar experiment for the evaluation of the adequacy of facial 
displays in the expression of some basic emotional states, based on a recognition task, 
are presented in a paper in this volume, where also the potentials of the used evalua- 
tion methodology are discussed [19]. 



References 

1. Working Group at the Dagstuhl Seminar on Multimodality, 2001, questionnaire on Mul- 
timodality http://www.dfki.de/~wahlster/Dagstuhl Multi Modality 

2. Ferrigno G., Pedotti A., “ELITE: A Digital Dedicated Hardware System for Movement 
Analysis via Real-Time TV Signal Processing”, IEEE Trans, on Biomedical Engineering, 
BME-32, 1985,943-950. 

3. Boersma P., “PRAAT, a system for doing phonetics by computer”, Glot International, 5 
(9/10), 1996, 341-345. 

4. Magno Caldognetto E., Cosi P., Drioli C., Tisato G., Cavicchio F., “Coproduction of 
Speech and Emotions: Visual and Acoustic Modifications of Some Phonetic Labial Tar- 
gets”, Proc. AVSP 2003, Audio Visual Speech Processing, ISCA Workshop, St Jorioz, 
France, September 4-7, 2003, 209-214 . 

5. Drioli C., Tisato G., Cosi P., Tesser F., “Emotions and Voice Quality: Experiments with 
Sinusoidal Modeling”, Proceedings of Voqual 2003, Voice Quality: Functions, Analysis 
and Synthesis, ISCA Workshop, Geneva, Switzerland, August 27-29, 2003, 127-132. 

6. Pelachaud C., Magno Caldognetto E., Zmarich C., Cosi P., “Modelling an Italian Talking 
Head”, Proc. AVSP 2001, Aalborg, Denmark, September 7-9, 2001, 72-77. 

7. Cosi P., Magno Caldognetto E., Perin G., Zmarich C., “Labial Coarticulation Modeling 
for Realistic Facial Animation”, Proc. ICMI 2002, 4th IEEE International Conference on 
Multimodal Interfaces 2002, October 14-16, 2002 Pittsburgh, PA, USA., pp. 505-510. 

8. Cosi P., Magno Caldognetto E., Tisato G., Zmarich C., “Biometric Data Collection For 
Bimodal Applications”, Proceedings of COST 275 Workshop, The Advent of Biometric 
on the Internet, November 7-8, 2002, Rome, pp. 127-130. 

9. MPEG-4 standard. Home page: http://www.chiariglione.org/mpeg/index.htm. 

10. Cosi P., Fusaro A., Tisato G., “LUCIA a New Italian Talking-Head Based on a Modified 
Cohen-Massaro’s Labial Coarticulation Model”, Proc. Eurospeech 2003, Geneva, Swit- 
zerland, September 1-4, 2003, 127-132. 

11. Cosi P., Tesser F., Gretter R., Avesani, C., “Festival Speaks Italian!”, Proc. Eurospeech 
2001, Aalborg, Denmark, September 3-7, 2001, 509-512. 

12. FACEGEN web page: http://www.facegen.com/index.htm . 

13. Cohen M., Massaro D., “Modeling Coarticulation in Synthetic Visual Speech”, in Magne- 
nat-Thalmann N., Thalmann D. (Editors), Models and Techniques in Computer Anima- 
tion, Springer Verlag, Tokyo, 1993, pp. 139-156. 

14. Perin G., “Facce parlanti: sviluppo di un modello coarticolatorio labiale per un sistema di 
sintesi bimodale”, MThesis, Univ. of Padova, Italy, 2000-1. 



112 



P. Cosi et al. 



15. Ekman P. and Friesen W., Facial Action Coding System, Consulting Psychologist Press 
Inc., Palo Alto (CA) (USA), 1978. 

16. Tiede, M.K., Vatikiotis-Bateson, F., Hoole, P. and Yehia, H, “Magnetometer data acqui- 
sition and analysis software for speech production research”, ATR Technical Report TRH 
1999, 1999, ATR Human Information Processing Labs, Japan. 

17. BTS home page: http://www.bts.it/index.Dhp 

18. Ruttkay Zs., Noot H., ten Hagen P., “Emotion Disc and Emotion Squares: tools to explore 
the facial expression space”, Computer Graphics Forum, 22(1) 2003, 49-53. 

19. Costantini E., Pianesi F., Cosi P., “Evaluation of Synthetic Faces: Human Recognition of 
Emotional Facial Displays”, (in this volume) 



Design of a Hungarian Emotional Database for 
Speech Analysis and Synthesis 



Mark Fek^, Geza Nemeth^, Gabor Olaszy^’^, and Geza Gordos^ 

^ Budapest University of Technology and Economics, 
Department of Telecommunications and Media Informatics, 

{f ek.nemeth, olaszy ,gordos}@tmit .bme .hu 
^ Hungarian Academy of Sciences, 

Kempelen Farkas Speech Research Laboratory, 
olaszySnytud . hu 



Abstract. This paper reports on the ongoing work of designing and 
building a Hungarian emotional speech database. First, a brief overview 
of the related earlier work on the subject is given. Next, the targeted 
applications of our emotional speech database are described. The pro- 
blem of creating or collecting suitable prompts for different emotions 
and speaking styles is addressed. Then, we discuss the problem of collec- 
ting material for child tale reading. Finally, we present the methods of 
database validation and annotation. 



1 Research Paradigm 

We distinguish between four levels of expressive speech. They are speech ex- 
pressing basic emotions (fear, anger, happiness, sadness, etc.), speech expressing 
secondary emotions (ashamed, proud, desperate, etc.), communicative attitudes 
(helpful, reasoning, accusing, etc.), and general expressive speaking styles (e.g. 
child tale reading). 

We presume that within each of the four levels, there are distinct acoustic 
clues characterizing the different types of expressive speech. We want to explore 
them and build them into speech recognition and speech synthesis applications. 
We take an engineering point of view, thus we want to find clues which can be 
realistically built into current speech recognition and synthesis technology and 
which can lead to the development of potentially useful products. 

The usefulness of the different types of expressive speech depends either on 
the research goals or on the targeted application. Much of the current research 
on speech and emotions is concerned with basic emotions. 

Basic emotions can be useful as a test bank of expressive speech recogni- 
tion and synthesis. There are only a few of them (researchers usually distinguish 
between 6 to 10 basic emotions [1]) and we expect them to differ more charac- 
teristically than the secondary emotions. 

Secondary emotions are much more numerous (some authors distinguish bet- 
ween more than 150 secondary emotions [2]) but they are more common in ever- 
yday situations, which makes them more interesting from an application oriented 
point of view. 



E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 113-116, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




114 



M. Fek et al. 



Communicative attitudes are also numerous. Some authors distinguish bet- 
ween more than 300 different ones [2]. 

Speaking styles are a somewhat different category than the previous three. 
They are more closely related to the application level. They are characterized 
by some global acoustic tendencies. 



2 Related Earlier Work on Hungarian Expressive Speech 

The first, pioneering work about how emotions are expressed in Hungarian speech 
was carried out by Fonagy [3]. The authors used musical notation (using one 
note per syllable) to describe the melody pattern of sentences expressing 82 
different types of emotions. Most of the sentences were semantically biased. The 
examined material was collected for four years. It consisted of recordings with 
actors, as well as spontaneous conversations by male, female and child speakers. 
Usually only one sentence per emotion type was examined. For each emotion, 
a textual description of the specific intonation (mainly via musical terms) and 
the voice quality is given. Some general observations were also made regarding 
speaking styles. For instance, it was noted that the ’child tale reading’ speaking 
style has a stylized, smooth, singing like melody. The main utility of the book 
is the rich collection of different speaking styles. Some of its insights might also 
prove to be helpful, especially the clues differentiating between emotions. 

Aside from the aforementioned work, there was some sporadic research con- 
cerning speech and emotion. Some sentences conveying basic emotions were re- 
corded and listening tests were carried out to test whether the intended emotions 
can be identified [4]. The recordings were done using semantically neutral sen- 
tences. To facilitate the task of the reader, the semantically neutral sentences 
were inserted into mini-dialogues providing the emotional background for the 
reader. 



3 Applications 

The applications of emotional speech are either related to speech recognition or 
to speech synthesis. Speech recognition related applications require the collection 
of emotions occuring in real-life situations. Speech synthesis related applications 
can be based on databases of acted speech which is much easier to collect. 

We have targeted three applications. The first two are related to speech 
synthesis, while the third one is related to both speech recognition and synthesis. 
The current scope of the database does not cover the requirements of the third 
application. 

The first application is computer games, where a character speaking emo- 
tionally depending on the situation can have a high entertaining value. The 
somewhat genuine expression of some basic emotions can make the application 
interesting. 
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The second application is story telling. As a first approximation, we consider 
the reading of child tales, where the dialogues and some narrative parts can be 
emotionally colored. 

Both of the above applications can be extended with a lip synchronized tal- 
king head capable of expressing the emotions visually. 

The third application is the creation of virtual agents capable of perceiving 
emotions and expressing different attitudes. The obvious task is to recognize 
whether the user becomes frustrated or angry while communicating with the 
machine. The number of attitudes to express can be limited to a few ones. 

4 Database Contents 

The first part of the database represents the following 12 emotions: neutrality, 
surprise, incomprehension, anger, worry, sadness, disgust, fear, frightening, cu- 
riosity, relief, happiness. Most of them are basic emotions. We added incompre- 
hension, curiosity, and relief to have some non-basic emotions which are easy to 
produce. Frightening represents expressive speech which intends to emotionally 
influence the listener. It is interesting in its own right, and it can be useful in 
a story telling application. The recordings will be done by 4-8 male and female 
actors reading some semantically neutral sentences like ’Peter has entered the 
room We are embedding the semantically neutral sentences into mini-dialogues 
or mini stories to invoke the appropriate emotions. To test whether the mini- 
dialogues are appropriate, we make recordings with university students before 
working with professional actors. To explore the interaction between semantic 
and acoustic contents, we also record semantically biased sentences for each of 
the emotions. 

The second part of the database serves the development of a child tale reading 
application. It will consist of a collection of recordings extracted from commer- 
cially available CDs. The CDs include child tales read by professional actors. To 
gain some additional data, we use an amateur speaker to make three recordings 
of some of the tales. The selected tales include a lot of emotionally colored dia- 
logues, like ’Equally strong? - the monkey asked mockingly’. We expect to have 
a large collection of expressive sentences which are easy to label and directly 
related to the targeted application. 

5 Database Validation and Annotation 

Once the recording phase is finished, we verify whether the utterances convey 
the intended emotions. The utterances failing to express the intended emotions 
will be excluded from further examinations. We will examine, however, why the 
excluded utterances failed to express the intended emotion. 

We will use two methods to validate the database. In case of semantically 
neutral sentences, the listeners should identify the emotional content by a forced 
choice test [1], marking one from a list of n emotions. In case of semantically 
non-neutral sentences, the result of the forced choice test would be biased by the 
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semantic content of the utterance. Therefore we will use a different evaluation 
method in this case. The intended emotion would be known to the listener, who 
should evaluate the expressiveness of the utterance by giving a subjective grade. 

The database will include the phonetic transcription and the prosodic an- 
notation of the audio recordings. The prosodic annotation will include the pitch 
marks and sound boundaries, as well as the extracted pitch and intensity con- 
tours. A stylized annotation is clearly needed, however we have not yet decided 
which method to use. We prefer a method matched to the intonation model of 
our TTS [5]. 

An utterance level emotional annotation will be included, with the option of 
using word level emotional labels. The labels used in the first part of the database 
are already defined. The emotional labeling of child tales will require the on the 
fly identification of the necessary emotional categories. We will use a flexible tree- 
structured three-level hierarchical labelling scheme, where the first level contains 
only three labels (positive, negative, neutral). The labels of the first level are 
subdivided into finer categories defining the labels of the second level. Similarly, 
the third level contains labels refined from the second level. The intensity of an 
emotion can also be indicated on a three level scale (weak, moderate, strong). 
Additionaly, a second emotional label can be optionally given to each utterance 
or word. This allows a more precise labelling of complex emotions. 

6 Summary 

This paper reported on the ongoing work of building a Hungarian emotional 
speech database. We have identified four different target applications and pointed 
out their different requirements. Our current work is focused on the development 
of a database of a core set of 12 emotions, and on the collection and annotation 
of child tales. The problems of validation and annotation of the database were 
also discussed. 
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Abstract. We describe the emotion and dialogue aspects of the virtual 
agents used in the MRE project at USC. The models of emotion and dia- 
logue started independently, though each makes crucial use of a central 
task model. In this paper we describe the task model, dialogue model, 
and emotion model, and the interactions between them. 



1 Introduction 

In the Mission rehearsal exercise project at USC, we have implemented virtual 
humans [24], with a large range of functionality, who can interact with people 
and other virtual humans in a range of roles, for simulation-based training. The 
virtual humans use the core architecture inherited from the Steve agents [23], 
but add significant new capabilities, including new bodies and extended non- 
verbal behaviors, a new more extensive spoken language dialogue capability [32, 
27,26], and emotional modelling [11,15], [16]. In this paper we outline some of 
the dialogue and emotion capabilities of these agents, and describe how each 
influences the other, concluding with current integration work. 

In Section 2, we describe the team task model, used as a basic object of 
analysis by both the emotion and dialogue models. In Section 3, we summarize 
the dialogue model. In section 4, we describe the emotion model. In section 5 we 
describe the ways in which we have started using each of these models to inform 
the processing of the other. Finally, we conclude in section 6 with our current 
integration work in these areas. 



2 The Virtual Human Task Model 

The ability of our agents to collaborate with human and agent teammates on 
tasks in the virtual world stems from their understanding of those tasks. Agents 
must understand task goals and how to assess whether they are satisfied, the 
actions that can achieve those goals, the selection and execution of those actions, 
and how to adapt execution to unexpected events. To provide this understan- 
ding, our agents use domain-independent reasoning algorithms operating over a 
general, declarative representation of team tasks, and this representation is used 
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to encode their domain-specific task knowledge for a given training scenario (or 
class of scenarios) . The task model also serves as a basic source of reasoning both 
for dialogue (talking about tasks, negotiating on tasks [28]) and for reasoning 
about emotion (how an agent feels, given its plans, goals, and beliefs). 

Our task representation extends earlier work on virtual humans for team 
training [23], augmenting this relatively standard plan representation so that 
it supports conditional, decision-theoretic and intentional reasoning. Each task 
description includes of a set of steps, each of which is either a primitive action 
(e.g., a physical or sensing action in the virtual world) or an abstract action (i.e., 
itself a task). Abstract actions give tasks a hierarchical structure. There may be 
ordering constraints among the steps, which define a partial order. Interdepen- 
dencies among steps are represented as a set of causal links and threat relations 
[18]. Each causal link specifies that an effect of a step in the task achieves a 
particular goal that is a precondition for another step in the task (or for termi- 
nation of the task). Threat relations specify that an effect of a step threatens a 
causal link by unachieving the goal before it is needed. 

To support decision-theoretic reasoning, preferences over action effects are 
represented by numeric utility values that are either primitive (indicating that 
the effect has intrinsic worth for some entity) or derived (indicating that the 
effect has worth as a means towards some end, as in a subgoal) and a probability 
calculus to indicate the likelihood of certain actions and effects. 

To support intentional reasoning, action effects and other state predicates 
are tagged with a belief, indicating if the virtual human has committed to a par- 
ticular truth value concerning this predicate. The probability calculus represents 
the virtual human’s measure of belief. Actions and action effects are also tagged 
with an intention, indicating the virtual human’s commitment to bringing about 
the indicated action or effect. 

Finally, to support negotiation, the representation encodes multiple, exclusive 
ways to achieve goals. These alternatives may differ in terms of their effects, 
likelihood, and utility, entities involved, etc. 

An agent’s task model represents its understanding of the task in general, 
independent of the current scenario conditions. To guide execution of the task 
and robustly handle unexpected events that require adaptive execution or re- 
planning, agents use a partial-order planning algorithm over the task model. 
Agents continually monitor the state of the virtual world via messages from the 
simulator [22] that are filtered to reflect perceptual limitations [24]. The result 
of this planning algorithm specifies how the agent privately believes that the 
team can collectively complete the task, with some causal links specifying the 
interdependencies among team members’ actions. Agents continually revise this 
private plan as the scenario unfolds. 



3 Dialogue Model 

Our agents use a rich model of dialogue that is closely linked with the task mo- 
del both for interpretation of utterances as well as for decisions about when the 
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agent should speak and what to say. Our dialogue model supports multiple si- 
multaneous conversations among potentially overlapping groups of interlocutors 
in a shared virtual world [32] . 

We follow the Trindi project approach to dialogue management [13]. The 
part of the context deemed relevant for dialogue modelling, termed informa- 
tion state, is maintained as a snapshot of the dialogue state. This state is then 
updated by dialogue moves, seen as abstract input and output descriptions for 
the dialogue modeling component. A complex environment such as the MRE 
situation obviously requires a fairly elaborate information state to achieve fairly 
general performance within such a domain. We try to manage this complexity 
by partitioning the information state and dialogue moves into a set of layers, 
each dealing with a coherent aspect of dialogue that is somewhat distinct from 
other aspects. 



— contact 

— attention 

— conversation 

• participants 

• turn 

• initiative 

• grounding 

• topic 

• rhetorical 

— social commitments (obligations) 

— negotiation 



Fig. 1. Multi-party, Multi-conversation Dialogue Layers 



The layers used in the current system are summarized in Figure 1. The 
contact layer [2,4,7] concerns whether and how other individuals can be accessi- 
ble for communication. Modalities include visual, voice (shout, normal, whisper), 
and radio. The attention layer concerns the object or process that agents attend 
to [20]. Contact is a prerequisite for attention. The Conversation layer models the 
separate dialogue episodes that go on during an interaction. Each conversation 
consists of a number of sub-layers, each of which may have a different infor- 
mation content for conversations happening at the same time. The participants 
may be active speakers, addressees, or overhearers [4]. The turn indicates the 
(active) participant with the right to communicate (using the primary channel) 
[20,31]. The initiative indicates the participant who is controlling the direction 
of the conversation [33] . The grounding component of a conversation tracks how 
information is added to the common ground of the participants [29]. The con- 
versation structure also includes a topic that governs relevance, and rhetorical 
connections between individual content units. Once material is grounded, even 
as it still relates to the topic and rhetorical structure of an ongoing conversa- 
tion, it is also added to the social fabric linking agents, which is not part of any 
individual conversation. This includes social commitments — both obligations 
to act or restrictions on action, as well as commitments to factual information 




120 



D. Traum, S. Marsella, and J. Gratch 



[30,17]. There is also a negotiation layer, modeling how agents come to agree on 
these commitments [3,25]. More details on these layers, with a focus on how the 
acts can be realized using verbal and non-verbal means, can be found in [32]. 
We focus here on the level of social commitments, as this has the most direct 
connection to the task model and the emotion model. 

3.1 Obligations and Social Commitments 

Core speech acts have functions related to influencing the topic under discussion 
and establishing and resolving the commitments and obligations of speakers and 
other conversational participants towards states and actions. Core speech acts 
have a content which is either a state, an action description or a question about 
one of these. 

Each of the states and actions in the task model is annotated with semantic 
information that can be used to describe and recognize description of those states 
in natural language (and our speech-act based agent communication language). 
For example, the action of the sergeant securing the assembly area (which can 
be accomplished by having the squad leaders each secure a quadrant) is repre- 
sented as shown in (1). The resulting state of the assembly area being secure is 
represented as shown in (2). 

(1) agent sgt 
event secure 
patient assembly-area 
type act 



(2) object-id assembly-area 
attribute safety 
value secure 
polarity positive 
type state 



Speech recognition and natural language interpretation produces similar con- 
tents from spoken utterances. Dialogue processing then compares the NL repre- 
sentation to the relevant task model representations, and, if a sufficiently close 
match can be found with a task model state or action, that is seen as the referent. 

The core speech acts that are currently modelled include assert, info- 
request, order, request and suggest. Unlike many accounts of the effects 
of these speech acts (e.g. [6, 1,5, 9]), there are no direct effects on the beliefs, 
desires or intentions of the conversational participants. This allows for the possi- 
bility that participants are insincere in their utterances. Following [30], the direct 
effects involve social commitments, and one may then infer from these commit- 
ments the beliefs or intentions commonly associated with these utterance types, 
given additional assumptions. 
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Assertions will have the effect of establishing a commitment by the spea- 
ker that the state holds, or that action happened, is happening, will happen, 
or should happen, depending on the tense and aspect of the utterance. Info- 
requests have a question as their contents. Questions are (possibly partial) 
propositions together with a designated q-slot indicating the part of the propo- 
sition asked about. For example, (3) shows an info-request by the LT to the Sgt 
with the content being a question about whether the assembly area is secure. 
Info-requests have as their effect an obligation to address the question. Requests 
have an action as content, and the effect is an obligation to address the request, 
e.g., to consider and give feedback on the request. Orders, which can only be 
performed by a superior to a subordinate in the social structure, have as their 
effect an obligation to perform the action that is its content. Suggestions do 
not impose obligations, but do focus the topic on the action. 

(3) action info-req 
actor It 
addressee sgt 
type csa 

content q-slot polarity 
type question 

prop object-id assembly-area 
attribute safety 
value secure 
time present 
type state 



In addition to these forward-looking acts [8], there are also backward- looking 
acts, that point back toward previous dialogue acts or aspects of conversatio- 
nal structure. These will tend to relieve obligations e.g., by performing obliged 
actions or addressing other utterances. These include acceptances of requests 
(which will create an obligation to the requested act itself) as well as rejections 
and other moves that won’t. 

3.2 Dialogue Processing 

Language processing occurs in two distinct and interleavable “cycles”, one for 
understanding language and updating the information state, and a second for 
producing language. This separation of input and output processing cycles allows 
the agent to have an arbitrary interleaving of contributions by itself and others 
rather than enforcing a rigid turn-alternation. Each communicative contribution 
is simultaneously interpreted at each layer, and may correspond to a number of 
acts at different layers. Generation usually starts from an intention to perform 
a main act, however any realized utterance will also correspond to a number of 
acts, some of which (e.g., turn-taking) may be as much a result of the timing of 
the performance with respect to other events as to the planned behavior. 
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4 Emotion Model 

The computational model of emotion in our virtual humans is called EMA 
(EMotion and Adaptation) [12]. Like many computational models of emotion, 
EMA is informed by by the group of psychological theories of emotion collec- 
tively termed appraisal theory [14]. Unlike most computational models, however, 
EMA provides a deep process model of the mechanisms underlying emotion, in- 
cluding the cognitive assessments that precede emotion, their consequences on 
cognition and behavior, and the way these consequences impact subsequent as- 
sessments. In terms of the antecedents of emotion, it broadens the goal-based 
reasoning underlying prior process models to cover other cognitive assessments 
implicated by appraisal theories. It is also the first process model of coping, a 
process associated with emotion in appraisal theories, and one which is implica- 
ted in emotional biases on cognition , but typically neglected by computational 
models. By modeling these deep processes explicitly, we are able to facilitate a 
tight integration of emotion with dialogue processing. 

Appraisal theories argue that emotion arises from a person’s assessment of 
their relationship with their environment, including not only their current condi- 
tion but past events that led to this state as well as future prospects. Appraisal 
theory argues that people possess many distributed processes for interpreting 
this relationship (e.g., planning, explanation, perception, memory, linguistic pro- 
cesses) but that appraisal maps characteristics of these disparate processes into 
a common set of intermediate terms called appraisal variables. These variables 
characterize the significance of events from the individual’s perspective. Events 
do not have significance in of themselves, but only by virtue of their interpre- 
tation in the context of an individual’s beliefs, desires and intention, and past 
events. For example, the outcome of the latest presidential election might inspire 
joy, anger or indifference, depending on which candidate one desires and one’s 
anger towards an individual may be mitigated by whether one believes they in- 
tended an offensive act. As in many appraisal models, events are characterized 
in terms of a number of appraisal variables: 

— Perspective: from whose viewpoint is the event judged 

— Desirability: what is the utility of the event if it comes to pass, from the 
perspective taken (e.g., does it causally advance or inhibit a state of some 
utility) 

— Likelihood: how probable is the outcome of the event 

— Causal attribution: who deserves credit or blame 

— Temporal status: is this past, present, or future 

— Controllability: can the outcome be altered by actions under control of the 
agent whose perspective is taken 

— Changeability: can the outcome be altered by some other causal agent 

EMA differs from many models by modeling the processes that derive these 
assessments, and in particular, how subsequent coping responses can bias these 
assessments. We treat appraisal as a set of feature detectors that characterize the 
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current state of the agent’s mental processes. For example, an event is interpreted 
through the lens of the agents current plans to assess its potential to impact 
current goals. Each appraised event is mapped into an emotion instance of some 
type and intensity, following the scheme proposed by Ortony et al [21]. 

EMA supports multiple appraisals of the same event and can simultaneously 
appraise multiple events, forcing the issue of how the agent focuses on some 
aspect of its relationship with the environment. An activation-based focus of 
attention model computes a current emotional state based on most-recently ac- 
cessed emotion instances. So, for example, the act of talking about an event can 
bring associated emotions into focus. 

Coping determines how one responds to the appraised significance of events. 
Emotions can motivate people to act on the world (problem-focused coping) but 
they can also motivate us to re-interpret our circumstances, thereby influencing 
our prior beliefs, desires and expectations (emotion-focused coping). In EMA, 
coping strategies essentially work in the reverse direction of appraisal, identifying 
the precursors of emotion that should be maintained or altered (e.g., beliefs, 
desires, intentions, plans, expectations, etc.). Strategies include: 

— Action: select an action for execution 

— Planning: form an intention to perform some act (the planner uses intentions 
to drive its plan generation) 

— Seek instrumental support: ask someone that is in control of an outcome for 
help 

— Procrastination: wait for an external event to change the current circum- 
stances 

— Positive reinterpretation: increase utility of positive side-effect of an act with 
a negative outcome 

— Acceptance: drop a threatened intention 

— Denial: lower the probability of a pending undesirable outcome 

— Mental disengagement: lower utility of desired state 

— Shift blame: shift responsibility for an action toward some other agent 

— Seek/suppress information: form a positive or negative intention to monitor 
some pending or unknown state 

Strategies change behavior, but they also change the agent’s interpretation 
of its circumstances, leading to re-appraisal. For example, simply intending to 
perform an act can improve the agent’s appraised sense of control and generate 
positive emotions. In terms of behavior, coping strategies provide the input to the 
behavioral, task and language processes that actually execute these directives. 
For example, plan related coping will generate an intention to perform some 
action that will make an undesirable situation better which in turn leads to 
the planning system to generate and execute a valid plan to accomplish this 
act. Alternatively, coping strategies might abandon the goal, lower the goal’s 
importance, or reassess who is to blame. This close connection between appraisal, 
coping and cognition provides the processes “hooks” that facilitate the influences 
between emotion and dialogue. 
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5 Emotion and Dialogue: Reinforcing Each Other 

There are several aspects of dialogue interaction that have already begun to 
make use of the emotion model described in section 4. These include: reference 
resolution, initiative, content selection, and content realization. 

Reference resolution and providing appropriate answers for general questions 
can be fairly difficult. For example, if a conversation is started with a question 
like “What happened here?” , there can be many true answers indicating events 
that occurred in the present location. Some of these may be filtered based on 
common ground or assumed knowledge of the hearer, but otherwise it may be 
difficult to prefer one to another and provide natural answers without a lot 
of fairly involved commonsense reasoning. Using the “concerns” of the agent, 
calculated by the emotion reasoning, the agent can report on the one that is 
causing itself the strongest emotion. While this may not be ideal, in a Gricean 
cooperative sense, it does often mimic human behavior. 

Likewise, as part of the coping mechanisms, ‘the emotion module can signal to 
the dialogue manager that there is an important issue to discuss, even when not 
prompted by a specific question. In this case, the agent can take the initiative, 
when appropriate, and bring up the new topic. 

Emotion reasoning is also used in the agent’s language generation procedu- 
res. First, the concerns and coping strategies provide information that can be 
used to assess a suggestion or order, leading to choice of negotiation move [28]. 
Next, emotion reasoning can signal which aspects of a situation to focus on. For 
example, considering a head-on crash between two vehicles, one could focus on 
the event itself, or each of the contributing sub-actions of the individual vehicles 
hitting the other. Finally, emotion reasoning is used to select the words that are 
used to describe the situation, trying to find a closest match in affect between 
the words used and the desired coping strategy. For example, as part of a coping 
strategy of shifting blame, the agent can inform the content realization to bias 
the way it phrases dialogue [10]. For example, instead of neutrally describing a 
car accident, “We collided” , a shift-blame strategy could bias the phrasing to be 
“They rammed into us.” 

The dialogue model also influences the emotion reasoning. In general, dia- 
logue brings events into focus, which in turn brings into focus the associated 
appraisals, altering emotional state and potentially invoking a coping response. 
Moreover, how the agent copes with a stressful depends on the dialogue context. 
Recall, EMA can appraise events from different perspectives. So, if a superior 
asks a subordinate agent about an event that not only brings into focus the 
event and the agent’s appraisals of it but also how the subordinate agent feels 
the superior might feel about the event. Coping strategy selection takes this 
into account and therefore the subordinate agent is, for example, more likely to 
shift-blame in response to a question of “What happened here?” if the event in 
question is something that the superior is unhappy about. 
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6 Current and Future Integration Plans 

There are several further areas in which we are starting to integrate further the 
emotion and dialogue reasoning to lead to better affective dialogue. First, we 
have implemented a system for agents to talk about their emotional state, using 
the same dialogue model used for talking about tasks [19]. The basic mechanism 
of describing emotions can also be used for other coping strategies, intended 
to induce particular effects other than just expressing a true inner state. For 
example, an agent can evoke sympathy by appearing distressed, or can perhaps 
cast blame aside by acting angry at another. Also, the coping strategies can be 
used to influence the type of dialogue behavior. E.g., when initiative is taken, 
how much attention to pay to current focus vs possible other interpretations, 
etc. Likewise, more of the dialogue state can be used for emotional assessment. 
For instance, some agents may feel distress over unaccomplished obligations or 
cases in which their actual beliefs differ from their commitments. 
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Abstract. In this paper we describe how to generate affective dialogs for 
multiple virtual characters based on a combination of both automatically 
generated and pre-scripted scenes. This is done by using the same technique for 
emotion elicitation and computation that takes either input from the human 
author in the form of appraisal and dialog act tags or from a dialog planner in 
the form inferred emotion eliciting conditions. In either case, the system 
computes the resulting emotions and their intensities. Emotions are used to 
inform the selection of pre-scripted scenes and dialog strategies, and their 
surface realization. The approach has been integrated in two fully operable 
systems, the CrossTalk II installation and the NECA eShowroom. 



1 Introduction 

During the last years our group has explored the simulation of conversations among 
animated agents as a new style to present information to a user. A shift from settings 
with single presentation agents towards the use of presentation teams bears a number 
of advantages: They enrich the repertoire of modalities to convey information and 
they can serve as a rhetorical device to reinforce beliefs. In a first system, the so- 
called Inhabited Market Place [1], we focused on presenting information about a 
product, e.g., a certain car, in the form of a simulated dialog between a salesperson (S) 
and one or more customers (Cj, C2, C3, ...). In essence, this system generated query- 
answer dialogs with queries referring either directly to product attributes or to value 
dimensions that are associated with the type of product under discussion. In the car 
domain the set of value dimensions includes prestige, safety, sportiness, 
environmental friendliness, etc. Typically, the role of virtual customers is to pose 
questions, whereas the virtual salesperson provides answers on the basis of available 
product information, e.g., product features stored in a database. In addition to asking 
questions or answering them, the virtual characters could also comment on a 
preceding turn of another character. Fig. 1 shows an excerpt of a generated car-sales 
dialog. 

Regarding dialog generation the system has been implemented in two different 
variants. One variant deploys a single dialog planner that, somewhat similar to a 
playwright of a classical theatre performance, determines the course of the whole 
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conversation, i.e. all contributions of the involved characters together with the 
temporal order of who is going to speak next. In contrast to this, a second version of 
the system adopts the multi-agent 
paradigm and equips each of the 
involved characters with its own 
dialog planner [2]. While the 
multi-agent version is somewhat 
more challenging regarding turn- 
taking assignment and maintaining 
coherency of the conversation, it 
is more flexible as to the number 
of conversational partners and 
seems more suitable if the user 
herself takes part in the 
conversation as well. In either 
case, the generated dialogs are 
mostly task-oriented in the sense that the characters only talk about the presence or 
absence of certain product features, and the impact of features on associated value 
dimensions, in order to fulfdl the underlying product information task. 

In order to make the simulated conversations more interesting for human 
observers, we wanted to insert sub-dialogs which would be perhaps off-topic but 
unexpected and funny. Unfortunately, the generation of such intermezzi is difficult to 
automate especially when compared to those that have been manually scripted by 
creative content authors. Therefore, we decided to develop a platform that enables to 
coherently interweave manual scripted sub-dialogs with dialogs that are automatically 
generated. This approach has been successfully applied for the realization of 
CrossTalk, an interactive installation designed for public spaces, such as an 
exhibition, or a trade fair [3,4]. 

The CrossTalk installation provides visitors with a spatially extended interaction 
experience by offering two virtual spaces on separate screens, one displaying 
Cyberella, the installation’s hostess, the other displaying Tina and Ritchie, two virtual 
actors “hired” to perform car sales dialogs to inform visitors about a certain car. In 
addition, Tina and Ritchie can converse with Cyberella, thus creating the illusion that 
the agents have cross-screen conversations. This can be considered a playful 
illustration of the “computers as theatre” paradigm as introduced by [5]. Moreover, 
the installation relies on what we call a meta-theater metaphor. Quite similar to 
professional actors, characters in CrossTalk are not always on duty. Rather, they can 
step out of their roles, and amuse the user with unexpected intermezzi and rehearsal 
periods. For instance, when no visitor is present the installation is in OFF mode. To 
keep the actors alive for the purpose of attracting new passers-by, we emulate Small- 
talk conversations among stand staff members. We rely on a repertoire of pre-scripted 
small-talk scenes from which the system would choose in OFF mode. For 
demonstrating CrossTalk at the CeBIT 2002 fair a large corpus of pre-scripted scenes 
(more than 220 for English and German each) has been assembled by one of our 
colleagues with experience in theater acting and directing. Some episodes cover 
themes related to every-day belongings, such as what to do in the evening, or to 



s 


What can 1 do for you? 


c, 


We are interested in that car. 


C2 


How fast can it drive? 


s 


It drives up to 225 kmih. 


C2 


That’s fast. 


C3 


This is bad for the environment! 


s 


Bad for the environment? This car 

is made of recyclable materials. 

Besides, it has a catalytic converter. 


Ci 


How much does it cost? 



Fig. 1. Portion of a simulated car-sales conversation 
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wonder about career prospects. Other scenes refer to the world of the theater or 
movies. 

Fig. 2 provides an impression of a pre-scripted scene. In this case, the literal 
informational content is not really of interest to a third-party observer while the 
reflection of affective state and interpersonal relationships among the characters may 
be. In fact, informal feedback collected 
from visitors who interacted with the 
installation and watched performances 
of the characters revealed that almost 
all of them found the scripted episodes 
more amusing and interesting to watch 
than the automatically generated car- 
sales performances. 

This observation motivates the aim to color simulated dialogs through the 
expression of emotions in order to create a lively performance. To this end, we first 
need to increase the cognitive model of our characters by allowing them to have 
emotions. Secondly, we need to simulate changes in emotions as consequences of 
executed dialog acts. Thirdly, we need to refine dialog generation so that emotions are 
taken into account and exploited in the generation process. 



Cy.: 


What are you gonna do after this? 


Tina: 


Got a job at Walmart. 


Cy.: 


Online sales? 


Tina: 


Something like that. 


Ritchie: 


You're kidding. You're gonna be 




a sales puppet? 



Fig. 2. Pre-scripted scene 



2 Related Research 

Virtual characters and especially embodied conversational characters are now widely 
used in various application areas including virtual training environments [6], portable 
personal guides [7], interactive fiction [8] and storytelling systems [9], as well as e- 
commerce applications [10], and in interfaces of consumer electronics [1 1]. 

Many conversational characters have been developed for applications assuming 
that the character engages in a face-to-face conversation with the user. In order to 
increase the believability of the virtual conversation partner, researchers have begun 
to address the modeling and emulation of human-like qualities such as personality and 
affective behavior. Examples of emotional characters include COSMO [12], Emile 
[13], Peedy [14], and the Greta agent [15]. In these systems emotion modeling has 
been inspired by the so-called OCC model developed by Ortony, Clore, and Collins 
[16], although the approaches differ in the granularity of modeling, the mathematical 
machinery for computing emotions, and in the way of how the model has been 
implemented on a technical level. 

Systems with multiple characters have also been proposed by others. Earlier 
approaches, such as Gilbert & George [17], Mr. Bengo [18], and [19], however, they 
do not explicitly model affective states in their characters. In contrast to these 
approaches, a few research groups have started to address emotion modeling in multi- 
character scenarios. In the context of a military mission rehearsal application Traum 
and Rickel [20] address dialog management comprising human-character and 
character-character dialogs in immersive virtual environments. Prendinger et. al. [21] 
developed a framework for scripting presentations with multiple affective characters 
in a web-based environment. Part of their work is the SCREAM system that computes 
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affective states based on the OCC-model but also considers aspects of the social 
context, i.e., role and status of the characters. The simulation of group dynamic 
phenomena in multi-character scenarios has been addressed by [22] and [23], Both 
approaches are based on socio-psychological theories, Guye-Vuillieme et. al. aim at 
the generation of believable displays of non-verbal interactions while Rist and 
Schmitt simulate attitude changes of virtual characters in multi-party negotiation 
dialogs. Finally, the work by [24] is of high relevance for our work since it addresses 
the variation in linguistic style with regard to some socially determined variables. 
However, their generation approach does not distinguish between emotion categories 
but represents a character’s affective state just by a single abstract parameter called 
emotional disposition. 



3 Outline of Approach 

As stated in the introduction, our aim is to improve the quality of simulated 
conversations among virtual characters by modeling their affective states in order to 
exploit such states as an additional resource for the generation of dialog contributions, 
the articulation of verbal utterances, and the generation of non-verbal expressions. 



3.1 Modeling Emotions and Personality Traits 



To give our characters the ability to have emotions, we follow the lines of the OCC 
model [16]. OCC is cognitive model of emotions, and is essentially based on the 
concepts of appraisal Table 1. Examples of OCC-Emotions 

and intensity. The 



individual is said to 
make a cognitive 
appraisal of the 
current state of the 
world. Emotions are 
defined as valenced 
reactions to events of 
concern to us, actions 
of those we consider 
responsible for such 
actions, and objects/ 
persons. The OCC 
theory defines 22 
emotion types. For the purpose of the current paper, we concentrate on those listed in 



Group 


Description 


Emotion Type and Name 


Well-being 


Appraisal of a 
situation as 
an event. 


Joy: an event is desirable for self. 
Distress: an event is undesirable for 
self. 


Prospect- 

based 


Appraisal of a 
situation as 
a prospective event. 


Hope: a prospective event is desirable. 
Fear: a prospective event is 
undesirable. 


Attribution 


Appraisal of a 
situation as an 
accountable action of 
some agent. 


Pride: approving of one’s own action. 
Admiration: approving of another’s 
action. 

Shame: disapproving of one’s own 
action. 

Reproach: disapproving of another’s 
action. 


Attraction 


Appraisal of a 
situation as containing 
an attractive or 
unattractive object. 


Liking: finding an object appealing. 
Disliking: finding an object unappealing. 



Table 1. 



Emotion intensities and their decay differs across personalities. Similar to other 
approaches, we rely on so-called personality dimensions (or traits) to describe the 
personality of our characters. In particular, we consider the personality traits proposed 
in the Five Factor Model of personality [25]: opeimess, conscientiousness, 
neuroticism, extraversion, and agreeableness. We represent a character’s conformance 
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with a certain trait as an adjustments on a scale between -1.0 (not all conformant) and 
+ 1.0 (fully conformant). This way, the personality profile of a character is given by 
the set of adjustments for all traits. Depending on the degree to which a character 
conforms with a certain trait both intensity of emotional reactions and decay of 
emotion intensities are affected. For instance, we model our characters to feel a joy 
emotion more intensely the more extravert and agreeable they are. We express 
dependencies of this kind by biasing the emotions with a baseline intensity according 
to the settings of personality traits, e.g. an extravert character’s baseline intensity for 
joy is 0.15, whereas an introvert character’s baseline intensity for joy would be 0.0. 
We use the openness and conscientiousness traits to influence the emotions intensity 
in the attribution and attraction group. For instance, a conscientious character 
appraises the result of an irresponsible behavior as more blameworthy than a less 
conscientious character. 

Fig. 3 shows a graphical 
interface that can be used to 
specify a character’s perso- 
nality traits (down right), to 
choose among several emotion 
decay functions (top right), and 
an emotion monitor that traces 
intensity development and 
decay of emotions during a 
simulated conversation (left). A 
check mark on the right hand 
(see hope) signals that this 
emotion currently impacts the 
character’s non-verbal and 
verbal behavior. Furthermore 
the elicitors of an emotion are 
displayed in an hover frame. 

For testing/ redesigning pur- 
poses this interface allows to 
manipulate a character’s emotions at run time (second slider to the right of each 
emotion). 

3.2 Emotion Elicitation and Change of Affective States 

A central aspect of the OCC model is that it associates each emotion type with a set of 
emotion eliciting conditions (EECs). We represent EECs as a set of variables that 
characterize emotion-inducing situations. We currently use the following set of 
variables: Desirability of events, Praiseworthiness of actions, Appealingness of 
objects, Liking reflecting how attracted to another person one is. Likelihood reflecting 
the degree of belief that an anticipated event will occur, and Realization reflecting the 
degree of belief that an anticipated event has occurred. EEC variables take on values 
from the range [-1.0, 1.0]. For example, in the case of Desirability the value -1.0 
means very undesirable (induces distress) whereas 1.0 stands for very desirable 
(induces joy). The signs of the values determine which emotion types are elicited (e.g. 



ratcNe 
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Likino — 

Disliking ^ 
Attribution 
Admiration - - 

Shame 

Reproach ^ 

Pride O — 

Prospect-based — 
Fear C3^ 
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Relief ^ 
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Hope 
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Does it have a catalytic converter? j 
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Fig. 3. Emotion monitor and configuration panels. 
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distress or joy) and the absolute values determine their intensities. The outcome of the 
subjective appraisal is a list of emotions: (ei, Cn). 

Applying OCC to compute emotions in dialogs among virtual characters starts by 
identifying what kind of events, actions and objects can occur and how to map them 
to EECs. Our characters 
live in a virtual (2D or 3D) 
environment and have 
limited perceptual capabili- 
ties. In the Cross Talk 
installation, they are only 
able to recognize a limited 
number of events, such as 
user feedback via touch 
screen, simulated system 
failures, or the arrival and 
departure of visitors de- 
tected by a web cam. Since we focus on group conversations, we can circumvent this 
limitation. Our characters learn about events, actions, and objects by listening to the 
dialog contributions of the other conversational partners. We hereby assume that the 
internalization of events, actions or objects is obtained either through a direct 
perceptual process or results from interpreting dialog acts. In either case, we associate 
events, actions or objects with EECs. Since we deal with multi-character scenarios, 
we need to model that different characters may perceive the same event, action or 
object differently, and in case of overhearing dialog contributions (referring to events, 
actions or objects) they may develop different emotional responses. Therefore, events, 
actions or objects must be mapped to ECCs separately for each character. In sum, our 
model comprises several mapping steps as illustrated in Fig. 4. 




4 Generating Affective Car-Sales Dialogs - The NECA 
eShowroom 

The NECA eShowroom (www.eshowroom.org) is a more elaborated version of the 
Inhabited Market Place [1] introduced in Section 1. Based on user input the system 
generates an animated car sales dialog between virtual characters [26]. The 
eShowroom consists of the following main components: a dialog planner, a 
multimodal natural language generator, a text/concept-to-speech synthesis, and a 
gesture assignment module. 

The dialog planner takes the role of a playwright that generates a script (a 
sequence of dialog acts) for the virtual characters that become the actors in a 
simulated car sales dialog. A dialog act represents a communicative function, such as 
requesting some information (request), answering a question (confirm, inform), or 
giving feedback (agree). Dialog strategies represent a certain sequence of dialog acts 
(or moves) of the participating dialog partners as they can be typically observed in the 
genre at hand. A sales dialog typically starts with a greeting phase, followed by the 
customer’s request for information about a specific product. Subsequently, a question- 
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answer game between the customer and the salesperson develops where various 
aspects of the product are discussed. Finally, the customer communicates a purchase 
decision and, in a closing phase, the dialog ends. 

The generated script that encompasses all dialog moves is than handed over to the 
multimodal natural language generator, which transforms dialog act representations 
into text, annotated with syntactic, semantic, and pragmatic features. The component 
is also responsible for nonverbal behavior, such as selecting iconic or emblematic 
gestures. The task of the text/concept-to-speech synthesis is then to convey, through 
adequate voice quality and prosody, the intended meaning of the text as well as the 
emotion with which it is uttered. It also provides information on the exact timing of 
utterances, syllables and phonemes, which is indispensable for the gesture assignment 
module. This module is responsible for fine-tuning the synchronization of speech and 
gesture, including proper alignment of verbal and nonverbal output. 

For generating affective dialogs among virtual characters we make two extensions 
to our dialog planner. Firstly, we augment the dialog act representation so that it 
comprises an emotion slot. It will be instantiated by the most dominant emotion of the 
act performer (i.e., the speaker). 

Secondly, we need to interleave dialog planning with our mechanism for emotion 
elicitation and computation. In our approach this coupling is done via the concept of a 
current discourse state (CDS) as illustrated in Fig. 5. The current discourse state 
depends on context information, such as 
knowledge concerning the dialog partners 
and their actions, as well as events and 
objects in the world. When a dialog act is 
performed, the CDS is updated. Since the 
dialog act may have an impact on the 
characters’ emotions, a so-called Emotion 
Engine performs an update of the 
characters’ affective states. In other 
words, we compute how a new situation 
is appraised by the dialog partners. When 
the dialog planner determines the next dialog move, i.e., speaker and dialog act, it will 
take into account the updated affective states. In particular, the speaker’s most 
dominant emotion will be used as an additional parameter for text generation, gesture 
alignment, and speech synthesis. The performance of the dialog act will cause a next 
update of the CDS and in turn change the affective states of the characters. 

For the purpose of illustration let us assume that a virtual salesperson has just 
introduced a new car to the customer and said: “This is a very sporty car, it has 240 
horse powers”. This new information about the existence of an object with this feature 
will be appraised by the customer, either positively or negatively depending on the 
customer’s attitudes towards sporty cars. If positively appraised, a liking emotion will 
be elicited and depending on its intensity, it may be reflected in the customer’s next 
dialog move, e.g. “Wow, 240 horse powers, that’s wonderful!” 

Somewhat more tricky are question-answering games. Suppose the customer 
would like to continue the dialog by posing a question about another car attribute. To 
this end, the repertoire of dialog strategies comprises strategies of the type 
“QuestionAnswer:Boolean” in which the customer requests information about a 



Oialoge Planner 









Emotion Engine 
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Discourse 

Manager 
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Fig. 5. Coupling a Dialog Planner with an 
Emotion Engine via a Discourse Manager 
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boolean attribute, e.g. by asking whether the car has a catalytic converter. Depending 
on the attribute value, the salesperson will confirm or disconfirm this. 

In order to be able to reflect emotion in the query, e.g. through a modification of 
the voice quality, we need to anticipate the appraisal of the situation in which the car 
actually has the requested feature. Suppose the customer pays attention to 
consumption and environmental friendliness. The presence of a catalytic converter 
will then be appraised as highly desirable (D = 0.8). However, since at the current 
state in the dialog this is not yet confirmed the odds may be fifty-fifty so that the EEC 
variable Likelihood takes on the value 0.5 (L ~ 0.5). Based on these EECs a hope 
emotion will be elicited with a relatively high intensity (hope ~ 0.82). Assuming that 
this is the most dominant emotion in the current situation, the dialog planner will 
assign the hope emotion to the request act. At the surface level, it may then be 
reflected in the formulation of the query, e.g., by using the wording: 

“It certainly does have a catalytic converter, doesn ’t it? ” 
accompanied by a hopeful facial expression. 



5 Interweaving Emotional Scripting and Dialog Generation - 
CrossTalk II 

As mentioned in Section 1, a peculiarity of the CrossTalk installation is that dialog 
simulation relies partly on a plan-based generation approach and partly on a repertoire 
of human authored scenes. To design an new version of CrossTalk with affective 
characters (CrossTalk II), we use the plan-based approach for affective dialog 
planning as sketched in Section 4. As in the original version of CrossTalk, a simple 
authoring syntax is meant to appeal to non-programmers, including professional 
playwrights, and scripts can be written with any text processing software. To be able 
to interweave authored and automatically generated scenes at runtime, a script 
compiler transforms authored scenes into the same internal format that is used by the 
dialog planner [3,4]. 

Since we now model emotions in our characters, we need to make sure that 
authored dialog contributions have an impact on the emotions, too. In principle, one 
could try to automatically extract emotion eliciting conditions for the characters by 
analyzing and interpreting the authored scene scripts. Another approach is to provide 
the authors with a convenient method to add meta-level information relevant to the 
eliciting of emotions. In CrossTalk II we follow this approach and provide scene 
authors with two kinds of tag sets: appraisal tags and dialog act tags, see [27]. 

5.1 Appraisal Tags 

Appraisal Tags {ATs) express how a speaking character appraises the event, action or 
object about which it is talking. ATs serve as symbolic abbreviations for EECs to 
release scene authors from the burden of specifying EEC variables. Using brackets to 
indicate tags in a scene script, appraisal tags are inserted directly behind the 
transcriptions of dialog contributions as shown in the example below: 

R.- The weather’s getting better. [=good_likely_future_event] 
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When processing a script, CrossTalk’s scene compiler maps ATs onto EECs of the 
speaking character by using default settings for the EEC variables. For instance, the 
appraisal tag in the example above is mapped to the EEC variables: 

[D:= 0.5 (moderately desirable), P:= nil, A;= nil, L:= 0.5 (moderately likely)] 

5.2 Dialog Act Tags 

Dialog Act Tags (DATs) are somewhat more specific annotations than ATs. Their 
function is to enable human scene authors with an intuitive means to indicate the 
underlying communicative intent of an 
utterance. The tags are inserted behind 
the utterance they refer to. The dialog 
act’s addressee(s) must be specified. 

For example, Figure 6 shows the use of 
a dialog act tag in an authored scene. 

The utterance of the character Ritchie is 
a verbal attack on the character Tina. For CrossTalk II we have defined currently 23 
DATs and an addressee can be one of the characters Tina (T), Ritchie (R), Cyberella 
(C), the user (U) or all conversation partners (ALL). 

Similar as in the case of appraisal tags, the Scene Compiler maps DATs onto 
EECs. However, in contrast to appraisal tags which concern only the EECs of 
speakers, DATs concern speaker, addressee, and in some special cases even further 
characters not explicitly specified as 
addressees in the DAT. For instance, in 
CrossTalk the character Cyberella has 
the role of a moderator. Therefore, if 
the two actors Tina and Ritchie attack 
each others in front of the audience, she 
may consider this as an undesirable 
event, too. To release the script author 
from the burden of tagging all dialog 
moves in full detail, however, the 
author can define specific mapping 
rules for each DAT. Regarding the example in Figure 6, the corresponding mapping 
of the dialog act tag attack is shown in Figure 7. 

5.3 Tagging User Feedback 

In the NECA-eShowroom as well as in CrossTalk the user is primarily in the role of 
spectator who cannot participate in the conversations of the virtual characters. 
CrossTalk, however, solicits user feedback on its performances. When the characters 
Tina and Ritchie act out a car-sales performance, the user’s touch screen displays an 
applause and boo button that can be pressed anytime in order to indicate appreciation 
or dislike. We treat such indications similar to dialog acts that are directed to the 
characters and that have a certain impact on their emotions. More technically 
speaking, each button is associated with a dialog act tag which is in turn mapped onto 
EECs and eventually cause an update of the characters’ emotions. This way 
CrossTalk characters react emotionally on user input. Thus, by giving frequent 



$Speaker $Text [=attack $Addressee] ::= 
THEN ADD $Addressee-EEC\ 

D:= -1.0 (very undesirable) 

P:= -1 .0 (very blameworthy) 

THEN ADD $Speaker-EEC: 

D:= +0.5 (moderately desirable) 

AND IF $Spea/terNOT Cyberella 

THEN ADD Cyberella-EEC: 
D:= -0. 5 (moderately undesirable) 
P:= -0.5 tmoderatelv blameworthvt 



Fig. 7. Example of mapping DAT to EECs 



Tina: / didn’t get the Job for the MTV 
webpage. It went to some kid that 
looked like Britney Spears. 

Ritchie: Well, can you sing? [=attack T] 



Fig. 6. Authored scene with an dialog act tag 
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feedback the visitor can influence the emotions of the characters and elicit emotional 
responses. For instance, in case a visitor keeps on pressing the boo button (negative 
feedback) for a certain period of time, Cyberella becomes more and more distressed. 
Her emotional disposition may eventually change her dialog behavior. Depending on 
the current state of a car-sales performance she may interrupt and suggest to start 
again after the user has provided new parameter settings. 



6 Exploiting Affect in Dialog Generation 

There are many decision points in our applications at which knowledge about the 
emotions of speakers and addressees can be taken into account. Focusing on multi- 
party conversations (rather than performing physical actions), emotions can be used to 
inform the selection of dialog strategies and linguistic style strategies as proposed by 
[24]. They also play an important role in the turn-taking behavior (e.g. a spontaneous 
barge-in may result from an intensive emotion) and in the realization of concrete 
dialog moves by selecting the dialog act type and the propositional content. For each 
dialog act the system needs then to make a linguistic choice regarding wording and 
surface forms. In the speech production system the emotions should be reflected in 
the intonation and voice quality. In addition, the non-verbal behavior of both speaker 
and listeners needs to be considered. This includes gaze behavior, facial display of 
emotions, postures, gestures, and proxemics, i.e. maintaining interpersonal distances 
and interaction angles [28]. 

In the NECA eShowroom (Section 4) knowledge about the emotions of the dialog 
participants is used by the dialog planner, the multimodal natural language generator 
(M-NLG), and the speech synthesis module. The dialog planner evaluates the 
preconditions of the dialog strategies and selects the one that best matches the 
affective states of the characters. For instance, if the salesperson repeatedly says “1 
don’t know.” when being asked by the customer about certain features, the customer 
will get more and more frustrated. If the anger increases a certain threshold the dialog 
planner interrupts the ongoing question-answer game and selects a new dialog 
strategy that leads over to the closing phase. The M-NLG takes the emotion into 
account when creating the surface realization of a dialog act. Depending on the 
speaker’s emotion, the dialog act “inform about interior” could be realized as “It has a 
cramped interior.” (disliking) versus “The interior is more spacious than you would 
expect.” (liking). The selection of communicative gestures and facial expressions is 
also informed by the specified emotion. For this purpose we map emotions to the 
available facial animations (morph targets) and use the emotion intensity to determine 
the intensity of the facial expression. Our speech synthesis module works with so- 
called emotion dimensions, a simplified representation of the essential properties of 
emotions [29]. In NECA we have defined a mapping from OCC emotion types and 
intensities to the three emotion dimensions evaluation (positive vs. negative), 
activation (active vs. passive), and power (dominant vs. submissive). The dimensional 
approach to emotional speech synthesis uses rules to map any point in this three- 
dimensional emotion space onto its acoustic correlates. The resulting synthesis system 
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allows the gradual build-up of emotions, the synthesis of non-extreme emotions, and 
the change of emotional tone over time [30]. 

In the CrossTalk II system (Section 5) the emotions of the virtual actors are used 
by the presentation manager, the output renderer, and the speech synthesis module. 
CrossTalk’ s central module, the presentation manager, is responsible for selecting 
and executing scenes. Which scene is played next is defined in the scene flow. 
Emotions are used at runtime when making transitions (conditional branching) and 
when selecting between alternative scenes. For instance, when the user gives feedback 
during a performance, a feedback scene is played which depends on the type of 
feedback (applause or boo) and on the affective state of the two virtual actors. The 
presentation manager also contains a dialog planner, which generates scenes at 
runtime. It uses affective states in a same way as they as in the NECA eShowroom. 

The presentation manager forwards directions for virtual characters contained in 
scenes to the output renderer module. These directions are mapped to character 
actions and input for the speech synthesis. At this level, we use emotions to enrich the 
dialog contributions provided by the human author. We constantly change the 
characters’ facial expressions so that they reflect their emotions. This is done in the 
same way as in the eShowroom, i.e. by mapping emotions and their intensities on the 
available facial animations. In some cases gestures are triggered to reflect a characters 
emotions, e.g. if Ritchie is angry with Tina, he will automatically use a pointing 
gesture at Tina, when he makes a verbal attack. 

In CrossTalk we use commercial text-to-speech system for generating speech 
output. It provides an interface for changing speech parameters like baseline pitch, 
speed, and volume. Using rules of thumb and intuition we have established 
relationships between emotions and speech parameters, for example, to express joy, 
we raise the baseline pitch and the volume according to the emotion intensity [27]. 



7 Discussion and Future Work 

In this paper we have reported on our approach to add affect to simulated conver- 
sations among virtual characters. We started off from two concrete systems that 
generate dialogs but none of which did explicitly model emotions in the characters. 
We first extended our characters towards affective virtual personalities along the lines 
of the OCC model. By means of the so-called NECA eShowroom system we showed 
how a plan-based approach to dialog generation can be coupled with the computation 
of emotions. In our second system, CrossTalk II, we further showed how to 
interweave plan-based generation with human authoring of individual scenes to 
produce coherent affective dialogs. To this end we have introduced special tag sets 
that allow human scene authors to make annotations that impact the emotions of 
scripted characters. Abstracting from details on the syntactical level, this approach 
shares similarities with proposals for character scripting languages comprising 
emotion tags, such as AML, APML, MPML, or VHML [31]. However, a peculiarity 
of our approach is that we use the same machinery for emotion elicitation and 
computation for both automated dialog generation and the execution of human 
authored scenes. This allows us to smoothly interweave both approaches at the 
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emotional level. The outcome of our work comprises two fully operable systems, the 
CrossTalk II installation, and the NECA eShowroom. Still, there is much room for 
further improvements. 

We have made quite a number of simplifications regarding the implementation of 
OCC. We are aware of the fact that more elaborated OCC implementations for virtual 
characters exist that we could have adopted in principle. We preferred to start with a 
more simplistic approach since in both of our applications we deal with multiple 
characters, and in the case of CrossTalk II we had to address the additional integration 
of plan-based generation with human authoring. Having two running systems at hand 
will allow us to rapidly implement refinements of the emotion models and test their 
impact on the quality of the emerging conversations. 

Another yet not sufficiently addressed aspect concerns the exploitation of 
emotions in the generation of multi-modal dialog behaviors in a multi-party setting. In 
the context of the VirtualHuman project we will use knowledge about emotions more 
rigorously to refine decision making in our characters regarding action selection, 
dialog planning, and multimodal articulation. 

Finally, on a longer term perspective it would be desirable to have users fully 
participate in affective conversations. However, this requires much more research on 
reliable analysis and interpretation methods for processing spoken input and inferring 
the user’s affective state. 

Acknowledgements. This work has been built upon contributions from the 
VirtualHuman project (www.virtual-human.org) funded by the German Ministry for 
Education and Research and from the EU-funded 1ST NECA project. 
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Abstract. This paper describes the first running prototype of a system for 
domain-oriented spoken conversation with life-like animated fairy tale author 
Hans Christian Andersen. Following a brief description of the system architec- 
ture, we present our approach to the highly interrelated issues of making 
Andersen life-like, capable of domain-oriented conversation, and affective. The 
paper concludes with a brief report on the recently completed first user test. 



1 Introduction 

In recent years, animated interface agents have become a sub-specialty among 
developers of multimodal dialogue systems. The community is growing fast, as 
witnessed by, e.g., the large attendance at the 2003 Intelligent Virtual Agents 
workshop [9]. Basically, animated interface agents are characterised by the on-screen 
display of a more or less human-like animated face. Some animated interface agents 
are embodied as well, and distinction may be made between cartoon (face or 
embodied) and life-like (face or embodied) human interface agents, depending on the 
level of rendering realism. To most users, human animated interface agents are not 
very expressive without the use of output speech. The talking face of news reader 
Ananova illustrates a basic animated interface agent who speaks [1]. When today’s 
animated interface agents are interactive, this is primarily accomplished through 
spoken input and output, essentially turning the agent into a multimodal spoken 
dialogue system-cum-animated human output graphics. These agents are often called 
conversational interface agents despite the fact that the large majority of them are still 
task-oriented [5] and hence, as argued below, not conversational in an important sense 
of this term. Increasingly, future animated interface agents will be able to interpret 
both verbal and non-verbal input communication, they will gradually become more 
life-like, and they will go beyond the level of task-oriented systems. 

This paper describes the first prototype of a non-task-oriented, life-like animated 
agent system which also understands 2D gesture input. The prototype has been devel- 
oped in the NICE (Natural Interactive Conversation for Edutainment) project [8]. 
Work in NICE aims to demonstrate English domain-oriented conversation with fairy 
tale author Hans Christian Andersen (HCA) in his study and Swedish spoken com- 
puter game-style interaction with some of his fairy tale characters in the adjacent fairy 
tale world. The present paper addresses the former goal. We propose the term 
domain-oriented conversation to designate a half-way post between task-oriented 
spoken dialogue [4, 6], and Turing-test compliant conversation [10]. In domain- 
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oriented conversation, the system is able to conduct unconstrained conversation about 
topics within its knowledge domain(s). The target users of the HCA system are 10-18 
years old kids. The primary use setting is in museums and other public locations 
where interactions are expected to have an average duration of 10-15 minutes. 

In the following, Section 2 discusses an influential empirical generalisation on user 
perception of agent life-likeness. Section 3 outlines the general architecture of the 
HCA system and of the HCA character module. Section 4 describes our key convers- 
ation strategies for making HCA a new kind of believable virtual semi-human. Sec- 
tion 5 focuses on one particular such strategy, i.e. emotional HCA. Section 6 briefly 
reports on the first user test of the HCA system. Section 7 concludes the paper. 



2 The Need for New Classes of Agents 

Although we are still in the dark in most areas, empirical generalisations are be- 
ginning to emerge from evaluations of interactive animated interface agents. One 
finding is that there seems to exist a “user-system togetherness” problem zone separ- 
ating two generic classes of agents. Due to the primitive nature of their interactive 
behaviour, some agents are so different from their human interlocutors that they are 
(almost) invariably perceived as systems rather than humans. This class includes, 
among others, simple task-oriented unimodal spoken dialogue systems speaking with 
a “computer voice” [11], primitive cartoon-style agents and other not very life-like 
agents [5]. However, as graphical life-likeness, conversational abilities, and/or per- 
sona expressiveness improve, users appear to start forming unconscious expectations 
to the effect that they are facing a system with human-like capabilities. If these 
expectations are thwarted, as they mostly are with today’s interactive agents, frustra- 
tion results. The user actually believed to be together with another human but wasn’t. 
The message for interactive animated interface agent research seems to be to find 
ways to safely pass beyond the problem zone by building interactive agents which no 
longer frustrate their users but, rather, constitute entirely new kinds of believable 
virtual semi-humans. Some of the means towards this goal are: to endow interactive 
agents not only with life-like graphical quality and domain-oriented conversation but 
also with non-stereotypical personalities, personal agendas and consistent emotional 
behaviour. Our aim is for HCA to become such a character, or agent. 



3 The NICE HCA System 

Two important goals in developing the HCA system are to investigate (i) how to 
successfully integrate spoken interaction with gesture input and non-verbal animated 
character output, and (ii) the use of spoken conversation for education and entertain- 
ment. The key goal, however, is to (iii) investigate non-task-oriented spoken conver- 
sation in a potentially realistic application. Arguably, the achievement of those goals 
requires a new kind of “self-reliant” animated conversational agents which no longer 
cause user frustration (Section 2). 
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We have developed a first prototype of the HCA system which was tested with 
target group users in January 2004. The prototype is running with simulated recog- 
nition. The recogniser still needs to be trained on large amounts of language data from 
non-native English speaking children and will be included in the second prototype. 

Figure 3.1 shows the architecture of the HCA prototype which is described in more 
detail in [3]. NISLab is responsible for HCA’s natural language understanding, char- 
acter modelling and response generation functionalities. The other components in Fig- 
ure 3.1 are being developed by other NICE project partners or are (based on) freeware 
(gesture recognition, message broker and speech synthesis). The project partners are 
TeliaSonera, Sweden, Liquid Media, Sweden, Scansoft, Belgium, and LIMSI, France. 




Fig. 3.1. General NICE HCA system architecture. 

The focus in this paper is on the HCA character module which is responsible for 
conversation management. Figure 3.2 shows the architecture of this module. 



HCA Character Module (CM) 




Fig. 3.2. HCA character module. DA is domain agent. MD is mini-dialogue. 

The character module is always in one of three output states, producing either non- 
communicative action output when HCA is alone in his study, communicative func- 
tion output when HCA is listening, or paying attention, to a visitor’s contribution to 
the conversation, or communicative action when HCA produces a conversational 
contribution, cf the mind state agent in Figure 3.2 and Section 4.4. 
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The mind-state agent generates HCA’s conversational contributions, managing 
HCA’s conversational agenda, interpreting the user’s spoken and/or gesture input in 
context, deciding on conversation initiative, and planning HCA’s verbal and non- 
verbal output. The conversational intention planner applies HCA’s conversational 
agenda to the user’s current input and keeps track of agenda achievement (see Section 
4.5). Six domain agents (DAs), one per knowledge domain, take care of domain- 
specific reasoning, including meta-communication, and user model maintenance 
(Section 4.3). The emotion calculator updates HCA’s emotional state (Section 5). 

Mind-state agent processing is supported by three additional modules. The 
conversation history stores a representation of the emerging discourse context for 
consultation by other mind-state agent modules. The knowledge base maintains the 
system’s ontology, including references to HCA output. Finally, the finite-state 
machine mini-dialogue (MD) processor processes all user-HCA mini-dialogues, i.e. 
predefined small dialogues of the kind familiar from task-oriented systems. The 
output references retrieved from the knowledge base are sent to response generation 
via the mind state agent manager and the character module manager. 



4 Life-Likeness and Conversation 

In view of the discussion in Section 2, virtual HCA should not, on the one hand, pose 
as the real HCA, nor, on the other, should the character be trapped in the “together- 
ness” problem zone in which interactive agents frustrate their users. To address this 
challenge, the first HCA prototype uses strategies such as the following: (i) a cover 
story, (ii) life-like output graphics, (iii) life-like domains of discourse, (iv) life-like in- 
and-out-of-conversation behaviour, (v) a conversation agenda, (vi) conversational 
principles, (vii) error handling, and (viii) emotional behaviour, (i) through (vii) are 
discussed in the present section, (viii) forms the subject of Section 5. 



4.1 Cover Story 

The cover story for HCA’s limited knowledge about his domains of conversation is 
that HCA is coming back! However, he still has to re-learn much of what he once 
knew. If the user would do him the favour of visiting him later, he is convinced that 
he will have become much more of what he once was. In addition to the very true 
information provided by this cover story, the story may help convince users that HCA 
is not (yet) a full virtual person. It may be added that HCA does not tell the cover 
story up front to new users. Rather, users are likely to come across the cover story if 
they either explicitly ask what HCA knows about, or can do, or if they show too much 
interest in things he does not know about (yet). 



4.2 Life-Like Output Graphics 

The HCA computer graphics has been developed by Swedish computer games 
company Liquid Media. Figure 4.1 shows 55-year old HCA surrounded by artefacts 
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in his study. Users can use gesture and speech to indicate an artefact which HCA 
might want to tell a story about. The study is a rendering of HCA’s study on display 
in Copenhagen, modified so that he can walk around freely and so that a pair of doors 
lead into the fairy tale games world (cf Section 1). Also, pictures clearly relating to 
HCA’s knowledge domains have been hung on the walls. 




Fig. 4.1. HCA in his study. 



4.3 Life-Like Domains of Discourse 

Development of domain-oriented conversation requires selection of one or several 
knowledge domains for the character. In the first NICE prototype, HCA’s knowledge 
domains are: his fairy tales (works), his childhood in Odense (life), his physical 
presence in his study (presence), getting information about the user (user), his role as 
“gatekeeper” for the fairy tale games world (gatekeeper), and the “meta” domain of 
resolving problems of miscommunication (meta). These domains are probably those 
which most users would expect anyway. 

Since we want to explore the domain development challenges “breadth-first” in 
order to investigate, among other things, how to handle potential cross-domain, 
multiple-domain, and super-domain issues, none of those domains have been devel- 
oped to their full depth in the first prototype. For instance, HCA only has in-depth 
knowledge of three of his most famous fairy tales, the Little Mermaid, the Ugly 
Duckling, and the Princess and the Pea. If a user asks about some other fairy tale, the 
user is told some version of HCA’s cover story. 

HCA has two mechanisms for in-depth conversation. The fairy tales are stored in 
template-style fashion in the knowledge base, enabling HCA to tell stories about, e.g., 
the main character in some fairy tale or the morale of a particular fairy tale. Mini- 
dialogues are used for structured, in-depth conversation about some topic, such as 
game-playing. HCA will show interest in games played by kids and adolescents today 
and he understands terms for games he is interested in, such as ‘computer games’ and 
‘football’. HCA also conducts a mini-dialogue in order to gather knowledge about the 
present user. The knowledge HCA collects about the user is stored in the user model 
for use during conversation (Figure 3.2). 
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4.4 Life-Like in-and-out-of-Conversation Behaviour 

As shown in Figure 3.2, HCA behaves as a human dedicated to fairy tale authoring 
both when he is alone in his study, when paying attention to the user’s speech and/or 
gesture input, and when producing natural interactive output. In the non-communicat- 
ive action output state, HCA goes about his work in his study as displayed through a 
quasi-endless loop of micro-varied behaviours. We cannot have him walk around on 
his own yet, however, because he may walk through the walls and crash the system 
due the not-fully-debugged graphics rendering. In the communicative functions 
output state, HCA pays attention to the user’s speech and/or gesture input through 
conversational recipient behaviours, such as looking at the user, nodding, etc. For this 
to happen in real time, the character module will soon have fast-track connections to 
the speech recogniser and the gesture recogniser to be able to act as soon as one or 
both of them receive input. In the communicative action output state, HCA responds 
to input through verbal and non-verbal communicative action [2]. 



4.5 Conversation Agenda 

HCA follows his own agenda during conversation. The agenda reflects his personal 
interests, e.g. his interest in collecting knowledge about the user and in having a good 
long conversation with motivated users. The agenda ensures some amount of 
conversational continuity on HCA’s part, making sure that a domain is pretty 
thoroughly addressed before moving to another, unless the user changes domain and 
is allowed to do so by HCA. HCA makes sure by keeping track of what has been 
addressed in each domain so far, which also helps him avoid repeating himself Also, 
since many users are likely to leave HCA’s study when learning that the double doors 
lead to the fairy tale world, HCA is reluctant to embark on the “gatekeeper” domain 
until the other domains have been addressed to quite some extent. If a user embarks 
on “gatekeeper” too early, HCA changes the domain of conversation. 



4.6 Conversational Principles 

Conversation, properly so-called, is very different from task-oriented dialogue. In 
addressing conversation, the seasoned spoken dialogue designer discovers the absence 
of the comforting and richly constraining limitations imposed by the inherent logic 
and combinatorics of dialogue about some particular task. Instead, the developer finds 
a different, and often contrary or even contradictory, richness which is that of spoken 
conversation. HCA follows a set of principles for successful, prototypical human- 
human conversation which we have developed for the purpose in the apparent 
absence of an authoritative account in the literature which could lend itself to easy 
adaptation for our purposes. The principles are: 

1 . initially, in a polite and friendly way, the interlocutors search for common ground, 
such as basic personal information, shared interests, shared knowledge, and 
similarity of character and personality, to be pursued in the conversation; 
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2. the conversation is successful to the extent that the interlocutors find enough 
common ground to want to continue the conversation; 

3. the interlocutors provide, by and large, symmetrical contributions to the 
conversation, for instance by taking turns in acting as experts in different domains 
of common interest, so that one partner does not end up in the role of passive 
hearer/spectator, like, e.g., the novice who is being educated by the other(s); 

4. to a significant extent, the conversation is characterised by the participants taking 
turns in telling stories, such as anecdotes, descriptions of items within their 
domains of expertise, jokes, etc.; 

5. conversation is rhapsodic, i.e. highly tolerant to digression, the introduction of 
new topics before the current topic has been exhausted, etc.; and 

6. conversation, when successful, leaves the partners with a sense that it has been 
worthwhile. 

The reader may have noted that the above list does not mention entertainment at all, 
despite the fact that the HCA system has an edutainment goal. This is partly because 
we assume that successful conversation is itself entertaining and partly because we 
want to focus on computer gaming-style entertainment in the second HCA prototype. 

The ways in which HCA pursues the principles listed above are the following. He 
assumes, of course, that the user is interested in his life and fairy tales (1,2). However, 
he is aware that common ground not only has an HCA aspect but also a user aspect. 
He therefore tries to elicit user opinions on his fairy tales, on his visible self and on 
his study. However, he also tries to make the user the expert (3) by asking about 
games played by children and adolescents today, demonstrating interest in football, 
computers, and the like. During Wizard of Oz collection of 30 hours and approx. 500 
spoken conversations with young users in the summer of 2003, we found that the 
users had strong interest in telling HCA about contemporary game-playing and also 
about technical inventions made after HCA’s times. HCA himself, in turn, does not 
just answer questions, or ask them, but tells stories - about his life, about his fairy 
tales, about wall pictures in his room, etc. (3,4). 

HCA’s main problem seems to be that he cannot always pursue in depth a topic 
launched by his interlocutor because, at this stage of development, at least, his 
knowledge and conversational skills are still somewhat limited, and we do not have 
sufficient information about the key interest zones of his target audience. This is 
where the rhapsodic nature of conversation (5) may come to his rescue to some 
extent. When, during conversation, and despite his following an agenda in conver- 
sation, HCA is lost and repeatedly does not understand what the user is saying, he 
changes topic or even domain in order to recover conversational control. 

Analysis of data from the user test of the system will, we hope, provide substantial 
information on the extent to which our implementation of the conversational strate- 
gies described above promise to achieve domain-oriented conversation, including 
evidence on whether the conversation is considered worthwhile by the users (6). 



4.7 Error Handling 

Error handling meta-communication is still rather primitive in the first HCA proto- 
type. We have considered four types of user-initiated meta-communication, i.e. clari- 
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fication, correction, repetition, and insult, and four types of system-initiated meta- 
communication, i.e. clarification, repetition, “kukkasse”, and start of conversation. 

User clarification is not handled in PTl. This is generally hard to do and we don’t 
really know which kinds of clarification may occur. Thus, we have decided to wait for 
PTl evaluation data before taking action. User correction is not treated as meta- 
communication in PT 1 but is handled as new input to which HCA will reply if he can. 
The user can ask for repetition and get the latest output repeated. The user may also 
insult HCA. In this case, HCA will react emotionally and provide rather rude verbal 
output. Repetition and insult are handled by the meta domain agent. 

HCA clarification is only handled to a limited extent in some of the mini-dialogues 
in PTl. PT2 is expected to allow direct clarification questions, e.g., concerning which 
picture the user pointed to. When HCA does not understand what the user said (low 
confidence score), he will ask for repetition or otherwise indicate that the input was 
not understood. HCA has various ways of expressing this, depending on how many 
times in succession he has not been able to understand what the user said. He also has 
a rhapsodic escape option from this situation, which is to jump to something comple- 
tely different. To this end, he has a so-called “kukkasse” which is a collection of 
phrases that, quite obviously, are out of context, e.g. “In China, as you know, the 
emperor is Chinese” or “Do you think that my nose is too big?”. The hypothesis is 
that such rhapsodic phrases will make the user change topic instead of trying to re- 
express something which HCA cannot understand. If no conversation is going on but 
HCA receives spoken input with a low confidence score, he will address the potential 
user to find out if a new conversation is starting by saying, e.g., “Would you like a 
chat with me?”. Asking for repetition, “kukkasse”, and figuring out if a conversation 
is starting are all handled by the meta domain agent. 

Finally, it may be mentioned that, if the system receives low confidence score 
gesture input, HCA does not react. This is to avoid inappropriate system behaviour in 
cases when a user is fiddling with the gesture input device (mouse or touchscreen). 



5 HCA’s Emotional Life 

Endowing HCA with emulated emotions serves two purposes. The first purpose is to 
add to his fundamental human-like features, the second, to make conversation with 
him more entertaining, due, for instance, to the occasional eruption of extreme 
emotional behaviour. This section describes HCA’s current emotional life. 



5.1 Modelling Emotions 

HCA has the simple emotional state space model shown in Figure 5.1. His default 
emotional state is friendly, which is how he welcomes a new user. During conver- 
sation, his emotional state may shift towards happiness, sadness, anger, or a mixture 
of anger and sadness. At any time, his current emotional state is represented as ES: 
[h:, s:, a: ]. Each attribute has a value between 0 and 10. If h (happiness) is non-zero, 
s (sadness) and a (anger) are zero. If s and/or a are non-zero, h is zero. The default 
friendly state is ES: [h: 0, s: 0, a: 0]. 
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Fig. 5.1. HCA’s emotional state space. 



5.2 Eliciting Emotional Change 

HCA’s emotional state changes as a function of the user’s input, for instance if the 
user insults HCA, wants to know his age, or shows a keen interest in the Ugly Duck- 
ling. Emotional changes caused by input semantics are identified in the knowledge 
base by domain agents. Emotional changes are called emotion increments and are re- 
presented as El: [h: , s: , a: ]. Increment values range from 1 to 10 and only a single 
emotional attribute is incremented per emotion increment. Each time an emotion 
increment is identified, it is sent to the emotion calculator (Figure 3.2) which updates 
and returns HCA’s emotional state. As in humans, the strength of HCA’s non-default 
emotions decrease over time. Thus, for each user input which does not elicit any emo- 
tion increments, and as long as HCA’s emotional state is different from the default 
ES: [h: 0, s: 0, a: 0], the state converges towards the default by (Ih) or (Is+la). 



5.3 Expressing Emotion 

HCA expresses his emotional state verbally and non-verbally. A threshold function is 
applied for selecting knowledge base output according to HCA’s current emotional 
state. In the friendly core (+/-6) area of happiness, sadness, and anger values, he 
expresses himself in a friendly manner. Beyond those values, and so far to a limited 
extent-only, he expresses himself in a pronounced happy, sad, or angry manner. 



5.4 Challenges for the Second HCA Prototype 

As described above, HCA’s first-prototype emotional characteristics include: a four- 
emotion state space, ability to react emotionally to input, and emotional state-depen- 
dent verbal and non-verbal output. Obviously, we need to evaluate those emotional 
characteristics as part of the user evaluation of PTl, before making strong design 
decisions concerning emotion in the second prototype. However, we have identified 
several potential improvements in emotional behaviour which are candidates for PT2 
implementation. These are described in the following. 
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First, FICA may need a more articulate emotional state space in PT2. However, 
compared to the points below, and despite the fact that more complex sets of emotions 
abound in the literature [9], this is not a top priority. As long as HCA’s mechanisms 
for emotion expression are strongly limited, there does not seem to be sufficient 
reason for endowing him with a richer internal emotional state space. 

Secondly, we would like to experiment with ways of systematically modifying 
verbalisation as a function of emotional state, for instance by using emotion tags for 
modifying HCA’s verbal conversational contributions on-line. 

Thirdly, we hope to be able to fine-tune HCA’s non-verbal expression of emotion 
to a far greater extent than in the first prototype. One way of doing this is to use his 
current emotional state to modulate the non-verbal behaviour parameters amplitude 
and speed. Thus, HCA would, e.g., smile more broadly or gesture more widely the 
more happy he is, gesture faster the more angry he is, and act and communicate more 
slowly the more sad he is. A second approach, compatible with the one just men- 
tioned, is to use rules for adding or deleting emotion tags in the response generator as 
a function of the current emotional state. This approach may also involve a layered 
model of non-verbal behaviour, so that basic posture is modified as a function of 
emotional state prior to modifying all posture-based non-verbal expressions. 

Fourthly, it is a well-known fact that humans sometimes change at least some of 
their emotions by expressing them. For instance, when a person expresses anger, the 
anger sometimes diminishes as a result. In such cases, although the expression of 
anger is a function of the user’s input, the reduced anger is not a function of the input 
but, rather, a function of the actual expression of the anger. We would like to have 
HCA do the same, cf [7]. 



6 User Test 

The first HCA prototype was tested at NISLab in January 2004 with 18 users, nine 
boys and nine girls, from the target user group of 10-18 years old children and 
teenagers. The users’ spoken input was fast-typed whereupon the system did the rest. 
This resulted in approximately 11 hours of audio, video, and logfile-recorded 
interaction and 1 8 sets of structured interview notes. Each user session had a duration 
of 60-75 minutes. A test session included conversation with HCA in two different 
conditions followed by a post-test interview. In the first condition, the users only 
received basic instructions on how to operate the system, i.e. speak using the headset, 
control HCA’s movement, control the four camera angles, and gesture using mouse or 
touchscreen. After 15 minutes the session was interrupted and the user received a set 
of thirteen typed scenario problems to be solved through speech or gesture input in 
the second session, such as “Find out if HCA has a preferred fairy tale and what it is” 
and “Tell HCA about games you like or know”. The problems might be addressed in 
any order and the user was not necessarily expected to carry out all of them. The 
purpose was to ensure a significant amount of user initiative to explore how the 
system would respond under the resulting pressure. 
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Figure 6.1 shows three excerpts from the user test conversations, illustrating, 
among other things, HCA’s story-telling strategy, “Kukkasse” evasions (*), user 
gesture input (**), and what may be an overly temperamental HCA response (***). 



HCA: If you know it what do you think of 
the character the little mermaid? 

U: she is very beautiful 

HCA: Can you tell me about the fairy tale 

The Little Mermaid? 

U: there was a little mermaid who could not 
wait till she turned fifteen so she could get 
up of the water 

HCA: What do you think of the prince the 
character from the fairy tale The Little 
Mermaid? 



U: she is very very curious 

HCA: Can you tell me about the prince the 

character from the fairy tale The Little 

Mermaid? 

U: he is a young prince who sees the little 
mermaid and falls in love with her 
HCA: I hear that the present crown prince 
Frederik has now found his princess. What 
do you think about it? 

U: I think it is very good for him and I think 
mary will be a good queen 



U: what is your favourite fairy tale 
HCA: I like several of my fairy tales very 
much. One of my favourites is the ugly 
duckling. The ugly duckling is not accepted 
by the other ducks and animals because it 
looks different. It has a hard time of 
suffering and troubles. Eventually it wishes 
only to die till it discovers that it is in fact a 
beautiful swan. The morale of the fairy tale 
is that if you are different then you will get 
bullied. 



U: how old are you 

HCA: I am fifty-five years old. And I am in 
a good shape occasionally I ride on horses 
and go for long walks and when I travel I 
even sometimes hike. 

U: what is your interests 
*HCA: Did I already tell you that only the 
bible has been translated into more 
languages than my fairy tales. 

U: really 



**U: picturejennylind 

HCA: This is a picture showing the Swedish 
nightingale Jenny Lind. She sang beautifully 
and is one of the most beautiful women I 
have ever met I fell very much in love with 
her. Unfortunately she didn't fall in love with 
me but she became a dear friend of mine a 
dear friend. You should try to point at the 



things in my study. I like to tell my guests 
about some of them. 

U: stop it 

***HCA: Don't piss me off! 

U: do you have anything to tell me about 
these two 

HCA: Sometimes I write fairy tales while I 
am travelling but I always finish the writing 
here at home in my study. 



Fig. 6.1. Three conversation snippets from the user test. 



The data collected in the first HCA system user test is being analysed at the time of 
writing. Based on the analysis made so far it seems fair to conclude that, overall, the 
prototype was received remarkably well by the target users. As predicted, the tested 
system version performed significantly better in the first condition in which HCA 
faced less aggressive and totally novice users 15 of whom never spoke to a system 
before. In the second condition in which all users had a tall scenario-based agenda of 
their own, the system’s limitations in handling sudden and frequent user-initiated 
domain and topic shifts and out-of-domain input became more evident, providing 
invaluable data for improving HCA’s conversational abilities. 
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7 Conclusion and Future Work 

We have described the recently completed first prototype of a life-like embodied 
domain-oriented conversational agent system. Following brief presentation of the sys- 
tem architecture, we focused on the strategies employed for making fairy tale author 
H. C. Andersen represent a new breed of believable embodied conversational charac- 
ters who, like their taciturn professional computer game counterparts but more so, 
help define the notion of a virtual person rather than emulate real persons. We briefly 
described the recent target user test of the prototype. Ongoing work includes in-depth 
analysis of the collected data, addition of speech recognition to the system, and, based 
on these steps, requirements and design specification of the second HCA system pro- 
totype with particular emphasis on conversational smoothness and flexibility. 

Acknowledgement. The work presented is being supported by the EU Human 
Language Technologies programme under contract IST-2001-35293. We gratefully 
acknowledge the support. 
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Abstract. We describe an implemented system for the simulation and 
visualisation of the emotional state of a multimodal conversational agent called 
Max. The focus of the presented work lies on modeling a coherent course of 
emotions over time. The basic idea of the underlying emotion system is the 
linkage of two interrelated psychological concepts: an emotion axis - 
representing short-time system states - and an orthogonal mood axis that stands 
for an undirected, longer lasting system state. A third axis was added to realize 
a dimension of boredom. To enhance the believability and lifelikeness of Max, 
the emotion system has been integrated in the agent’s architecture. In result. 
Max’s facial expression, gesture, speech, and secondary behaviors as well as his 
cognitive functions are modulated by the emotional system that, in turn, is 
affected by information arising at various levels within the agent’s architecture. 



1 Introduction 

Natural and intuitive communication with a computer is a primary research goal in 
human-computer interaction. In recent times this goal has frequently led to the 
employment of humanoid agents as interlocutors that are able to both understand and 
use communicative means natural to human users, i.e., speech, gesture, gaze, and 
facial expression. In our lab an anthropomorphic humanoid agent named Max is being 
developed [7]. In order to enhance his lifelikeness and believability, we aim at 
creating the impression of coherent emotions, moods and personality. This requires, 
first, appraising internal or external events w.r.t. their influence on the agent’s 
emotion dynamics, secondly, modeling the emotions and moods of the agent and their 
course over time as well as their mutual interaction (emotion dynamics), and, finally, 
communicating the emotional state in identifiable ways. In this paper, we focus on the 
last two aspects. The first aspect - the appraisal of emotions - may be accomplished 
in at least two ways by other components of the overall architecture: (1) on a non- 
cognitive level by direct sensory feedback in a VR application which is not described 
here, and (2) on a cognitive level by a BDI interpreter which performs deliberative 
reasoning. 

Based on original ideas of Wundt [18] and subsequent research, e.g., [10], the 
communication of affect can be seen as having three major dimensions of connotative 
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meaning: pleasure, arousal, and dominance (PAD). Further evidence has been found 
that emotions can be differentiated from moods by at least two aspects [8]: an 
emotion can usually be associated with its eliciting stimulus and is a short-lived 
psychological phenomenon, whereas a mood is a more diffuse and longer lasting 
phenomenon. Nevertheless it is assumed that moods are fortified and alleviated in 
some way by emotions [8] as well as that the elicitation of emotions is influenced by 
the actual mood in return; see [12]. Mood can also be understood as “an emotional 
state, perhaps of low intensity, capable of lasting for many minutes or several hours”; 
see [12]. 

We believe that both theoretical aspects, emotion and mood, along with their 
effects on each other, must be accounted for by a model that is to simulate 
coherent emotion dynamics for a human-like character. We thus developed a 
model that combines these two theories to realize an internal continuous dynamics 
and allows for symbolic output of categorical emotional terms. Furthermore, our 
model extends these theories by adding a concept of boredom that is understood as a 
state of relatively low arousal, which is attributed to an inadequately stimulating 
situation as proposed by [11]. This concept is especially useful as Max is situated in 
scenarios of real-time communication with a varying degree of interaction and the 
concept of boredom supports the emergence of proactive behavior in a natural way. 

As far as the expression of emotions is concerned, two different kinds of 
information processing are distinguished: On the one hand, discrete emotional terms, 
modulated by a continuous intensity value, are used in order to trigger emotional 
expressions of the face as well as to influence deliberative reasoning in the cognitive 
architecture. On the other hand, involuntary facets of Max’s observable behavior - 
e.g., his simulated breathing rate, eye blink frequency, and speech pitch - are 
modulated by the continuous part of the emotional data like the intensity of arousal. 
Together with the capabilities of deliberative reasoning, nonverbal signals like 
gestures and facial expression, and response to various forms of user interaction (e.g., 
speech, gesture and gaze of the user), we expect Max to be perceived as a more 
creditable conversational partner. 

After discussing related work in the following section, the emotion system is 
described in detail in Section 3. The integration of the emotion system in the cognitive 
architecture and the expression of affect are presented in Section 4. 



2 Related Work 

Several architectures were proposed for human-computer interfaces that account for 
affective states. Some have been implemented in expressive characters which are able 
to show their affective states via different communication channels such as voice 
quality and facial expressions, and which include such states in their deliberation 
process for action planning. Most of the existing computational systems of emotions 
rely on appraisal theories and logical reasoning about the eliciting factors of emotions. 
The majority of these systems are based on the OCC model by Ortony, Clore and 
Collins [13], see [6, 14, 16, 1] for examples. They either try to simulate a different 
number of discrete internal emotional categories, or aim at the expressive aspect of 
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emotions and therefore simulate only a smaller number of the emotion categories than 
established by the OCC model (see [17]). In [4] a “personality system” is presented 
which includes “emotional states” based on the OCC model and also uses a concept 
called “mood” to mediate between the static personality and the dynamic emotions of 
a virtual human. The notion of an “emotional impulse” as introduced in [4] is 
appropriate for our emotion system as well, but in contrast to their approach, our 
emotion system does not need an associated emotion category as an input signal. 

Especially the course of emotions in time as well as the mutual interaction of 
different emotion categories (i.e. the emotion dynamics) was found difficult to realize 
in purely cognitively motivated architectures of emotions (see [14] for details). De 
Rosis et al. [15] focus on the dynamics of their agent’s affective states in a 
conversational situation. In contrast to their approach, which concentrates on event- 
driven emotions of the OCC model, our system is not designed to represent a mixture 
of multiple emotions so far. But with respect to the dynamics of emotions and their 
response decay in the absence of new specific stimuli they deal with the same 
conceptual problems as in our case. Addressing these problems, it appears a 
promising approach to ground a model of emotions on dimensional theories, as 
theoretically founded by [18, 10]. This is true particularly when focusing on the 
expression of emotions and their course over time (see [2] for example) because they 
are self-contained and complete, even if their expressive power is in some way 
restricted. 



3 Internal Structure of the Emotion System 

The emotion system in our approach contains two conceptually different components 
(see Fig. 1). The first component is mainly concerned with the course of emotions and 
moods over time and the mutual interaction between emotions and moods, whereas in 
the second component the categorization on discrete emotional terms is accomplished 
after a mapping into the PAD space. 



3.1 Emotions and Moods over Time and Their Mutual Interaction 

The concept of emotions is linked to the concept of moods using a two-dimensional 
space defined by an x-axis of emotional valence and an orthogonal y-axis that 
represents the valence of moods (see Fig. 2). The system tends to hold both valences 
in absolute zero because this point is interpreted as the prevalent state of mental 
balance. Therefore two independent spiral springs are simulated, one for each axis, 
which create two reset forces Fx and Fy whenever the point of reference is displaced 
from the origin. 
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Fig. 1. Internal structure of the emotion system 



The exerted forces are proportional to the value of the corresponding valences X and y 
just as if the simulated spiral springs were anchored in the origin and attached to the 
point of reference. The mass-spring model was chosen here mainly based on the 
heuristics that it better mimics the time course of emotions than linear and exponential 
decreasing models. 

By adjusting the two spring constants dx and dy as well as the simulated inertial 
mass m of the point of reference, the course over time of both concepts can be biased 
intuitively. These parameters can also be construed as personality-related aspects of 
the emotion system. 



1 . Dynamic component 




Fig. 2. The linkage of emotions and moods and their courses over time 
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In order to simulate the alleviating and fortifying effects of emotions on moods, the 
emotional valence is interpreted as a gradient for changing the valence of moods at 
every simulation step according to Eq. 1. The independent parameter a is a 
personality-related aspect of the character, with smaller values of a resulting in a 
more sluggish agent and a greater values of a leading to a more moody agent. 



Af 



= a-x 



Equation 1. The influence of emotions (x) on moods (y) 



3.2 The Concept of Boredom 

In addition to the emotion dynamics described above, a concept of boredom is added 
to the dynamic component as a third, orthogonal z-axis. Assuming that the absence of 
stimuli is responsible for the emergence of boredom (as proposed by [11]), the degree 
of boredom starts to increase linearly over time if the point of reference lies within an 
epsilon neighborhood of absolute zero (as given by and Gy, see Fig. 3). Outside of 
this neighborhood the value of boredom is reset to zero per default. The co-domain of 
the boredom parameter is given by the interval [-1, 0], so the agent is most bored if 
the value of negative one is reached. 



1. Dynamic component 






r - | boredom (z-axis) 




Fig. 3. The epsilon neighborhood 



The linear increase of boredom can be described by the equation z{t + 1) = z(t) — b , 
where the parameter b is again a personality-related aspect of the emotion system. 
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3.3 Mapping into PAD-Space for Categorization 

The dynamic component provides the following triple at any time step t: 



D{t) = {x,,y„z,), 



with X, = [- 1 , 1 ], y, = [- 1 , 1 ], z, = [- 1 , 0 ] 



Equation 2. The output triple of the dynamic component 



The variahle Xt denotes the emotional state, the variable yt stands for the actual 
valence of the mood, and Zt represents the degree of boredom. Given this triple, the 
mapping into PAD space for categorization is implemented according to the function 
K(Xt, Yt, Zt, t) as shown in Eq. 3. This mapping results in a triple consisting of the 
functions p(Xt, yt) for the calculation of the pleasure value, a(Xt, Zt) for the arousal 
value and d(t) for the dominance value. 



K{x„y„z„t) = {p{x„y,),a{x,,y,),d{t)), 

with p{x,,y,) = ^»{x, + y,) and a{x^,z,) = \x,\ + z, 

Equation 3. Mapping from dynamic component into categorization component 



Pleasure is assumed to be the overall valence information in PAD space and 
therefore calculated as the standardized sum of both the actual emotional valence as 
represented by Xt and the valence of the actual mood as given by yt. That way, the 
agent will feel a maximum of joy when his emotion as well as his mood is most 
positive and a maximum of reluctance in the contrary case. 

The agent’s arousal ranges from sleepiness to a maximum of mental awareness and 
physiological exertion. As it is assumed that any kind of emotion is characterized by 
high arousal, in contrast to moods, only the absolute value of emotional valence is 
considered in the function a(Xt, Zt). The addition of the (negatively signed) value of 
boredom reflects its relation to the mental state of inactivity. 

The independent parameter of dominance (or, in the other extreme, submissive- 
ness) cannot be derived from the dynamic component. In human terms, this parameter 
describes the agent’s feelings of control and influence over situations and events 
versus feelings of being controlled and influenced by external circumstances [10]. By 
introducing this parameter it is possible to distinguish between angriness and fear as 
well as between sadness and annoyance. Angriness and annoyance come along with 
the feeling of control over the situation whereas fear and sadness are characterized by 
a feeling of being controlled by external circumstances ([9] gives a theoretical 
foundation). Therefore, it is in principle not possible to derive such information from 
the dynamic component. The BDI interpreter of the cognitive architecture of Max, 
however, is capable of controlling the state of dominance in an adequate way. 
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Fig. 4. The emotion categories in PAD space 

Several emotion categories have been anchored in PAD space by defining adequate 
PAD triples (see Fig. 4). Some categories exist twice because it is assumed 
unnecessary to distinguish between a dominant and a submissive case for these 
emotion categories. The aforementioned point of reference is presented in Fig. 4 as 
the grey sphere currently linked to the emotion category “angry” (indicated by the 
dotted line to give a visual feedback of the activated emotion category). 

Categorization (PAD space) 




Fig. 5. Thresholds <I> and A for each emotion category 

An emotion category is activated if the point of reference is getting closer than cp 
units to that particular category and if there is no other emotion category within a 
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smaller distance (see Fig. 5). Until the distance d gets below A units, the weight of the 
activated emotion category is calculated by the following equation; 



O -A 

Equation 4. Calculation of weight w for each emotion category 

The weight W is set to 1, if the distance gets below A. In Eq. 4, cD can be 
interpreted as the activation threshold and A as the saturation threshold, which are 
both global constants of the emotion system and valid for every emotion category. If 
no emotion category fulfills the above conditions for activation, the cognitive state of 
“confusion” is activated in the emotion system. This may occur in the case of a 
strongly positive valence of moods together with a highly negative emotional valence, 
or vice versa. 

4 Integration and Application 

4.1 Integration in the Agent Architecture 

Max is employed as a presentation agent in the Heinz-Nixdorf MuseumsForum (HNF', 
Paderborn, Germany). In this environment, the agent’s task is to conduct multimodal 
Smalltalk dialogs with visitors as well as to give explanations about the exhibition he 
is part of. Max is provided with typed user input as well as camera input and responds 
using synthetic speech, gesture, and facial expression. The agent may also leave the 
presentation screen when very annoyed, and reenter the screen after “calming down”. 

The system’s overall architecture (shown in Fig. 6) is similar to those commonly 
applied in embodied conversational agents (e.g. FXPAL [3]). It exhibits a two-level 
structure of concurrent reactive and deliberative processing, the latter being 
responsible for the agent’s conversational capabilities. The emotion system has been 
added to this architecture as a separate module that incessantly receives input from 
and sends data to several other components as indicated by the solid arrows in Fig. 6. 




Fig. 6. Integration of the emotion system in the agent’s architecture 
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4.2 The Input and Output Data of the Emotion System 

The components of the cognitive architecture of Max essentially feed the emotion 
system with emotional valence information. These positive or negative valences 
always originate from deliberative processes (interpretation and dialog manager) or 
as direct reactions to a positive or negative stimulus (perception). For example, the 
interpretation module sends a positive valence value if the user’s utterance has been 
understood as a compliment. Likewise, the achievement of a desired discourse goal, 
e.g., coming to know the user’s age after having asked for it, causes the dialog 
manager to send positive input to the emotion system. At the reactive level, the 
perception module delivers a positive impulse each time a user has been visually 
perceived*. Finally, a significant boredom value in the emotion system, which only 
arises in cases of no interaction with visitors, leads to the random execution of 
“displacement activities” (e.g., yawning, stretching, looking around) in the behavior 
generation component. The execution of such behaviors also reduces the boredom 
value in the emotion system. 

The emotion system in turn supplies the cognitive architecture of Max with the 
following data: 

1 . the mood valence and the degree of boredom of the dynamic component 

2. the corresponding PAD triple 

3. the emotion category and its intensity if one is activated or “confused” 
otherwise 

The first two kinds of information are non-cognitive information types. They are 
used in the behavior generation module to trigger secondary actions and to modulate 
involuntary facets of Max’s observable behavior, namely, the rate of his simulated 
breathing, the frequency of eye blink, and the pitch as well as the rate of his speech. 

The third kind of information is mainly used within the dialog manager at the cog- 
nitive level of Max’s architecture (see [1] for details). In general, deliberative reason- 
ning is realized by a BDI interpreter that operates on the agent’s beliefs, on desires re- 
presenting persistent goals and a library of plans - each having preconditions, context 
conditions, an effect and a utility function - to formulate intentions. The interpreter 
continually pursues the applicable plan with the highest utility value as an intention. 

The categorical output of the emotion system is incessantly asserted as belief of the 
agent. That way, the agent’s plan selection is influenced by his current affective state 
which he can also verbalize. In addition, the emotion is used as precondition and 
context condition of plans to choose among alternative actions or even to trigger 
actions when becoming “aware” of a certain emotion (by asserting an according 
belief). Finally, based on the mappings proposed by Ekman [5], the current emotion 
category along with its intensity value is directly reflected in Max’s facial 
expressions. This facial expression is then superposed on possible conversational 
behaviors like smiling. 



* This case shows how non-cognitive emergence of emotions can be modeled. Another example 
- realized in a VR application of Max not described here - is positive valence due to the 
human interlocutor petting Max’s cheeks. 
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4.3 Example 

Instead of demonstrating the well-known expression of basic emotions on the agent’s 
face, we show here an example situation in which the current emotional state of Max, 
being engaged in a conversation with a visitor, arises from the previous discourse and 
significantly influences the agent’s behavior. After being offended several times by 
verbal input of the visitor, the accumulation of the respective impulses in the emotion 
system results in increasingly negative emotions that become available to the agent’ s 
deliberative processes. When first becoming “angry” the agent says “Now I’m getting 
angry” with a low pitch and rate of his voice as well as an appropriate facial 
expression of angriness (see Fig. 7 left). Further negative impulses result in the 
emotional state of “annoyance” together with a bad mood. In effect, a plan is 
triggered which causes the agent to leave the display (Fig. 7 right) and to stay away 
until the emotion system has returned into balanced mood. The period of absence can 
either be shortened by complimenting Max or extended by insulting him again. 




Fig. 7. Max is getting angry and leaves the display 



5 Conclusion 

We presented an emotion system that has been integrated in a cognitive architecture 
of a multimodal conversational agent called Max. The underlying theories of 
emotions and moods, extended by a concept of boredom, are successfully combined 
to achieve a coherent long-time behavior as well as convincing spontaneous reactions. 
The additivity of the emotion dynamics and the bidirectional connection between 
emotions and moods support the traceability of the expressed emotions. By 
modulating the involuntary facets of Max’s observable behavior, controlling his facial 
expressions and influencing the deliberation process. Max is able to express his 
emotional state by a wide range of different communication channels. 

Furthermore the emotion system was devised open enough to be applied in any 
agent framework. It provides both fluent non-categorical data and weighted discrete 
emotion categories to be used by other technical components in order to meet the 
diverse requirements of each component. 
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In a first Smalltalk scenario the visual and auditory effects of the emotion system 
were experienced as beneficial for the believability of Max as a humanoid 
conversational agent. Especially the direct feedback by the facial expressions of Max 
was judged to be adequate and persuasive by a large number of visitors. Together 
with the described modulating effects of the emotion system a harmonious overall 
appearance was achieved at this stage. 

There are still some open questions to be addressed in future work. Mainly the 
control of the dominance dimension by the surrounding components has to be 
conceptualized and implemented in further detail to better fulfill the requirements of 
the underlying psychological theories. In contrast to the cognitive models of 
emotions, higher-level emotion categories (e.g., shame) are not representable in our 
approach. However, with the experiences gained in our work it seems a promising 
challenge to investigate whether purely cognitive emotion models (like OCC) can be 
extended concerning the course of emotions and their mutual interactions. 
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Abstract. This paper presents the design of an over-the-phone, non-task- 
oriented, conversational spoken dialogue system character created to chat with 
users about himself and entertain them. Focus is on techniques used in persona- 
lity modelling of the system character, how his mood changes based on input, 
and how this is reflected in his output. Experience with a first version of the 
system and planned improvements are also discussed. 



1 Introduction and Background 

During the construction of a telephone-based, task-oriented frequently asked quest- 
ions (FAQ) system concerning holiday allowance [7], we discussed the issue of how 
to recognise user dissatisfaction from what users say and possible reactions from the 
system to such input. In this context we felt a need to experiment freely with the 
limits and possibilities of the technology available to us which was a standard Speech- 
Mania platform with a Danish language package. 

The platform comes with a speech recogniser, a parser, support for recording 
output phrases, a Speech API for plugging in a speech synthesiser, and the grammar 
and dialogue description language HDDL. An application is developed by writing a 
dialogue model and a set of grammars in this language. The dialogue model and the 
grammar determine the application domain. To model emotions we could exploit 
features like (phonetic) lexicon, grammar, phrasing, recognition scores and dialogue 
flow, but the platform does not support exploitation of e.g. glottal stop, stress, 
prosody or non-linguistic vocal phenomena like laughter or mood. 

A result of the experiments was a non-task-oriented conversational spoken dia- 
logue system (SDS) whose character, soon named Crazy-Kurt, is childish, affective, 
self-centred, and defensive with an underlying uncertainty. His mood and temper is 
likely to vary considerably during a dialogue. The main goal of the Crazy-Kurt 
system is to entertain users. The user may call Kurt and chat with him in Danish, 
primarily about his preferences of food, notably fruit and vegetables. Kurt also knows 
a few other things about himself, e.g. his name, his age, and where he works. 

Crazy-Kurt was running on a demo-line at PDC for quite some time. He was called 
by colleagues, children of colleagues, and sometimes by customers during dinner par- 
ties. To customers Kurt was described as follows: 
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"You may call Kurt and get acquainted with him. Depending on your input he 
changes his attitude and becomes cross, embarrassed, or happy. If he percei- 
ves your input as negative, he becomes cross, if you praise him he becomes 
happy, and if you ask embarrassing questions he becomes embarrassed. The 
system uses a limited vocabulary (350 words, but it may without problems be 
increased to a couple of thousands words) and a mixture of speech synthesis 
and pre-recorded speech. The speech synthesis is used to tell the caller what 
the system recognised, which can lead to entertaining situations." 

We had much fun working with this system. Kurt was inspired by users’ sometimes 
quite rude input and weird behaviour. Contrary to typical task-oriented systems Kurt 
is allowed to behave in an impatient, selfish and impolite way, i.e. he has the same 
rights as users to express his emotions and opinions about the interlocutor. 

A FAQ system is in may ways very different from the Crazy-Kurt system. For 
instance, it must be polite and patient to users and appear serious and professional. 
Nevertheless, the experience we got from experimenting with the Kurt system both 
concerning technology and design created useful input to the FAQ system, e.g. 
regarding adequate reactions to the emotions users express. 

There is still much work which could be done on the Crazy-Kurt system and many 
possible improvements. Flowever, we find it worthwhile to report on our work now 
since we believe that many of the ideas behind Crazy-Kurt are of broader interest and 
have so far received little attention in the area of SDSs. 

We describe the design of and first experience with Crazy-Kurt. In particular we 
discuss techniques used in the modelling of his personality, including how input 
affects his mood and how his reactions are reflected in his output (Section 2). We 
discuss lessons learned from experiments with the system (Section 3). Section 4 brief- 
ly presents related work. Section 5 concludes the paper and discusses future work. 



2 Personality Modelling 

With Crazy-Kurt we wanted to create an entertaining, chat-like (non-task oriented) 
conversational character. The entertainment value was supposed to come from the 
dialogue and the way it evolves, and we wanted this value to be apparent from early 
on in conversations with Kurt. With conversational dialogue follows an increased risk 
for low recognition scores. Flowever, both entertainment value and low recognition 
scores may be handled via childish and affective characteristics as follows: 

• Self-centred: He only wants to talk about himself and his taste for fruit and 
vegetables. This has the advantage that it helps delimit the domain. 

• Affective and impatient: His mood is quite easily affected, which makes it less 
visible that the system not always understands what the user says. 

• Direct: Kurt may be rude and tactless in the way teenagers often find entertaining. 

• Embarrassed: Some topics are perceived as embarrassing by Kurt. It is irrational, 
but he reacts by becoming annoyed and by changing topic in such situations. 

• Defensive and evasive: Towards the interlocutor Kurt behaves in a confident and 
self-assured way. However, internally he is very uncertain. He will never admit 
mistakes. If Kurt does not understand the input, he will jokingly change topic. 
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The characteristics listed above are taken into account in the definition of dialogue 

structure and output. The personality modelling uses the following means: 

1 . The dialogue flow is based on dialogue acts, input parameters, and affective state. 

2. Affective state is modelled via satisfaction. Many other emotion parameters exist, 
e.g. hunger, anger, happiness, tiredness, and patience, but we wanted to start the 
experiments with one main parameter. 

3. Self-confidence, or just confidence, relates to recognition score values and user 
confirmations. High scores and user acceptance imply high confidence. 

4. Input parameters'. The lexicon, more precisely nouns and adjectives, are emotion- 
ally modelled in terms of face value, preference, and embarrassment. The lexicon 
(and grammar) also determines what Kurt can understand. 

5. Satisfaction changes are computed from confidence and the input parameters. They 
are a key parameter in exposing affect and mood changes. 

6. Output phrases are designed in accordance with the personality, including rude- 
ness, self-centredness and jokes. Phrases are varied according to the satisfaction 
and satisfaction changes. Feedback quotes user input directly using synthetic 
speech which also may be entertaining. 

Each of these modelling means is explained in more detail in the following sections. 



2.1 Dialogue Flow Architecture 

Kurt’s conversation is structured in terms of a simple state-model, cf Figure 1. 
Depending on the contents of the user’s input and the current satisfaction state, the 
Kurt system will select a path through the dialogue model and output utterances 
expressing the dialogue acts noted in each of the output boxes it passes on its way, see 
also Section 2.6. 



2.2 Satisfaction 

The central part of Crazy-Kurt’s personality model is his satisfaction level which 
varies on a scale between -5 (angry) and 5 (exalted). Regarding Kurt’s reactions the 
scale is divided into discrete intervals. At both ends the emotions may overflow. At 
the negative end Kurt will hang up while at the positive end Kurt will get the 
impression that the user is making a fool of him and will suddenly change to an angry 
mood. This is done to simulate a personality that is affective and seemingly self- 
confident but internally uncertain. 

The initial state is a neutral satisfaction level where Kurt is neither particularly 
angry nor happy. 



2.3 Self-Confidence 

Kurt’s self-confidence affects his reaction patterns and the magnitude of his satis- 
faction level changes. His self-confidence expresses how confident he is in having 
interpreted the user’s input correctly. 
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Fig. 1. Flow of the dialogue. The test “explicit” checks if the previous system turn ended with 
explicit feedback. Legend: 3 "* 3 output, action, input, condition, transition 
flow (with test). 



Confidence maps into the recognition score scale (Figure 2) and determines the type 
of feedback: For each input turn the recogniser provides a recognition score between 
0 and 1 . A low recognition score will imply explicit feedback to what the user said, 
and a low confidence will result in explicit feedback to a larger interval of low 
recognition scores. 

0 min 1 - confidence max 1 

score I ^ ^ ^ 1 

feedback explicit explicit implicit none 

Fig. 2. The feedback t5^pe is a function of confidence and recognition score. 
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The level of self-confidence is modified depending on the smoothness of the dialogue. 
In the current implementation only a fairly simple version of smoothness is realised, 
modelled via user reactions to explicit and implicit feedback. If the user agrees, Kurt’s 
confidence grows, while disagreement makes his confidence decrease. Reactions to 
explicit feedback has a larger effect than reactions to implicit feedback. The present 
model does not include other smoothness features in terms of topics and the flow of 
dialogue such as re-phrasings, unexpected input acts, or loops. 



2.4 Input Parameters 

Kurt’s degree of satisfaction continuously changes not only depending on self- 
confidence but also depending on the current input topic and on the course of the 
dialogue both of which influence the input parameters used. 

Every word in the system’s lexicon is tagged with a face value [-5; 5], preference 
[-5; 5], and embarrassment (boolean). Below we explain how face value, preference 
and embarrassment are calculated for an entire user turn. User input is assumed to be 
a concatenation of word sequences covering one or more concepts. Values are calcu- 
lated separately for each concept. See Table 4 for an example of a dialogue annotated 
with input parameter values. 

In addition we discuss the implications of negations in input which affect both face 
value and preference. 

2.4.1 Face Value 

Kurt is very sensitive to losing face. Certain kinds of interaction from the user have a 
negative face value, e.g. corrections and insults. Other kinds of interaction have a po- 
sitive face value, e.g. praise will boost his self-confidence. The full analysis is com- 
plex, and we have used a simplification by letting face value be a function of the input 
word face values: If any input word face value is negative, take the minimum. 
Otherwise, take the maximum. 

2.4.2 Preference 

Certain words/topics have been defined to denote something that Kurt does not like. 
Other words/topics are neutral, while others again denote something Kurt really likes. 

For input with a neutral face value, preference is independent of the face value. For 
input with other face values there will be a knock-on effect so that if the face value is 
positive, input denotes something Kurt likes, and vice versa for negative face values. 

Preference cannot be calculated as a function of single words. For instance, a 
preference for red hair is not a function of the preference for red and the preference 
for hair. All verbs, adjectives and nouns have attached a preference but combinations 
need a separate assignment of preference. However, we have used the same simplifi- 
cation as for face values and compute preference as a function of the input word 
preferences: If negative preferences exist, take the minimum, otherwise take the 



maximum. 
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2.4.3 Embarrassment 

Certain words/topics have been defined as embarrassing to Kurt. All other 
words/topics are neutral in this respect. In case of embarrassing input Kurt will try to 
talk about something else. 

If any input word is marked as embarrassing, the entire input is perceived as 
embarrassing by Kurt. 

2.4.4 Negation 

Negation changes face value and preference. It makes a considerable difference 
whether the user says “you are stupid” or “you are not stupid”. But it does not affect 
embarrassment since the embarrassing topic is mentioned no matter if negated or not. 

There are two types of negation. One type is syntactic such as constructions with 
“not”, e.g. “you are not clever”. A second type is semantic, realised via the use of 
verbs that are “negative”, e.g. “do you hate oranges”. The implication of a negation 
depends on whether the utterance is a statement or a question, cf Table 1. 

Table 1. Implications of negation in statements. The effects of questions without a “not” are 
identical to those of the corresponding statements and are omitted. 



User utterance 


Concept 


Preference 


Negative 

verb 


Not 


Change 


Statements 


you love fruit 


fruit 


positive 


no 


no 


positive 


you don’t love fruit 


fruit 


positive 


no 


yes 


negative 


you hate fruit 


fruit 


positive 


yes 


no 


negative 


you don’t hate fruit 


fruit 


positive 


yes 


yes 


positive 


you are stupid 


stupid 


negative 


no 


no 


negative 


you are not stupid 


stupid 


negative 


no 


yes 


positive 


you hate ice 


ice 


negative 


yes 


no 


positive 


you don’t hate ice 


ice 


negative 


yes 


yes 


negative 


you like shoes 


shoes 


neutral 


no 


no 


neutral 


you don’t like shoes 


shoes 


neutral 


no 


yes 


neutral 


you hate shoes 


shoes 


neutral 


yes 


no 


neutral 


you don’t hate shoes 


shoes 


neutral 


yes 


yes 


neutral 


Questions with “not” 












don’t you love fruit 


fruit 


positive 


no 


yes 


positive 


don’t you hate fruit 


fruit 


positive 


yes 


yes 


negative 


aren’t you stupid 


stupid 


negative 


no 


yes 


negative 


don’t you hate ice 


ice 


negative 


yes 


yes 


positive 


don’t you like shoes 


shoes 


neutral 


no 


yes 


neutral 


don’t you hate shoes 


shoes 


neutral 


yes 


yes 


neutral 



*) Note that in Danish “elsker du ikke frugf ’ can only be a question, not an imperative. 



2.5 Satisfaction Changes 

The input parameters face value and preference will change the level of satisfaction 
into a positive or negative direction, with an adjustment based on self-confidence. The 
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third input parameter, embarrassment, is only used to side-step, cf. Figure 1. The 
calculation of the change to satisfaction per turn is made as follows: 

contribution = (face value + 0.5 * preference) /I . 5 
change = (score > 1-confidence) 

? (1.5 * contribution) 

: (0.8 * contribution) 

The weights used may of course be adjusted. Those presented here give reasonable 
results in the Crazy-Kurt system. 

2.6 Output Phrases 

The output is based on canned phrases. They need to be tailored to the system person- 
ality and to the target user group. For example, a system which targets teenage users 
may use expressions commonly used by teenagers. The generation of an output turn 
includes the following rules: 

• The output is a concatenation of utterances depending on the user’s input and in 
accordance with the flow expressed in Figure 1 . See also the dialogue in Table 4. 

• Each utterance includes a dialogue act which is either question, prompt, joke, 
insult, feedback, or acknowledgement. 

• The style level of each utterance depends on the current satisfaction level. 

• For each act type and style level, the concrete phrase is selected randomly from a 
predefined set of phrases (typically three). 

• In the feedback, the user input phrases are quoted using synthetic voice, all other 
output is generated as pre-recorded phrases. 

2.6.1 Change Markers 

The most elaborate phrase generation is that of change markers which are small 
expressions or exclamations that mark a change in the level of satisfaction towards the 
user. These markers vary in accordance with the change in satisfaction (Table 2). See 
also the examples in the dialogue in Table 4. A couple of notes: 

• There is a clear difference between the suspicious markers at (angry level, exalted 
change) and the tolerant markers at (exalted level, angry change). 

• Flowever, the phrasings can only be approximate, and the same phrase may occur 
in several states. 



Table 2. Examples of change markers.’-’ means no marker. 



Change 


Angry 


Cross 


Neutral 


Happy 


Exalted 


Current 

level 


Angry 


You moron. 


Listen you fool. 


- 


Hrmf 


Why should 1 believe 
you? 


Cross 


Listen, you fool. 


Bah! Stupid! 


- 


Hm, perhaps. 


That’s better. 


Neutral 


Bah! Stupid! 


- 


Well 


- 


Okay! 


Happy 


Hello there. 


Stop it. 


- 


Fine, stay on. 


Great! 


Exalted 


Now, now. 


What now? 


- 


Great! 


You are my best 
friend! 
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3 Lessons Learned from Collected Dialogues 

As our focus has been on design experimentation and technology exploitation, no 
controlled experiments have been performed. However, while Crazy-Kurt was run- 
ning on a demo-line at PDC, we collected a corpus of calls and transcribed 86 dia- 
logues distributed over three of the system iterations. Some of the dialogues were per- 
formed by children. Although the system was simple, we learned a number of useful 
things to be exploited in a new version of the system. In particular we learned about 
user strategies when users talk to Kurt and what users perceive as entertaining. 

Users often start conversation in what would seem to be a standard way of getting 
basic information on the interlocutor. This is probably triggered by the system’s initial 
question but also by the fact that users know little or nothing about Kurt when they 
call him. The situation is much like a dinner party where you sit next to a person you 
never met before. The system starts by saying: “Hello it’s me. What do you want to 
know about me?”. In reply most users start by asking who the system is, cf the 
dialogue in Table 4. There are also initial questions concerning e.g. what he knows 
about, his age, where he is, and how he is. 

Many of the questions and comments provided by users during calls actually con- 
cern the system (“me”). Topics include - in addition to identity and age - where he 
works, where he lives, his childhood, wife, children, health, hair, eye colour, glasses, 
whether he smokes, is homosexual, or is a vegetarian. There are also a number of 
questions and comments including adjectives describing the system, e.g. stupid, 
clever, and handsome. Moreover, there are several questions/comments concerning 
his likes and dislikes regarding e.g. alcohol and food, football, music, work, and sex. 

Kurt is very focused on himself and this is probably part of the reason why many 
questions/comments concern Kurt. However, it is also likely that the user is seeking 
for common ground, i.e. something to talk to Kurt about in more detail. This is often 
what one does when sitting next to a new person. There are only few examples in the 
corpus of the user offering information about himself, such as “I play Diablo” or “I 
live in Fredensborg”. 

It seems clear that Kurt should help the user more regarding what are reasonable 
conversation topics, e.g. by suggesting topics himself early on, such as “do you like 
vegetables/peas/aubergines/...” or “I have blue hair. What colour is yours?”. 

A large group of the input utterances relate to the dialogue and to what the system 
says. They may be viewed as a kind of meta-communication. They are often triggered 
by the too small grammar but are a major part of the entertainment. Many children 
were really laughing when they heard what the system recognised, no matter if correct 
or not, cf. Table 3. When misunderstood, users would sometimes try to correct the 
system. In other cases they provided long input just to see what the system recognised 
and get a good laugh. 

The system’s grammar definitely needs to be enlarged, what we also knew in 
advance. The small grammar means that the dialogue too often ends in system mis- 
understandings. Although these may be fun when the system tells what it understood, 
the user may also feel that there are too many of them and that there ought to be at 
least some progress in conversation. It is somewhat difficult to have a coherent 
dialogue with the system for more than a few turns. 
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Table 3. An excerpt of dialogue 2002-04-05-0-4-14, translated to the English. 



No. 


Utterance 


U42.1 


Yes you are a fool. 


S43.1 


Bah. 


S43.2 


Listen to yourself: “You are a fool”. 


S43.3 


No. 


S43.4 


You are challenged regarding intelligence, poor you. 


U43.1 


Okay, twit, [laughter] 



Clearly an enhancement of Kurt’s background knowledge about himself is needed as 
well. His limited knowledge is also somewhat prohibitive for a longer coherent dia- 
logue. Moreover, he should be able to tell in a comprehensible way what he can chat 
about - at least when asked and maybe also if there is a problem with the dialogue. 
He may e.g. be equipped with a few interests which he can discuss in some detail. 

At a point we re-recorded the output phrases, now using a 14 years old teenage boy 
instead of the original 40 years old male. This triggered new kinds of input, like “Who 
are your parents?”. This emphasizes that in creating a character, all system features 
affect user behaviour and should be designed to fit together. 

In the input quotations the synthetic voice would say the exact words that were 
recognised, e.g. “you said ‘you are stupid’” which resulted in responses like “no, 
you are stupid” instead of the design expectation “yes (I said (you are stupid))”. 



4 Related Work 

Picard [10, p.2] notes that evidence is mounting that “emotional skills ... are essential 
for natural communication with humans” and coins the area affective computing. 
However, there is only little work on spoken dialogue for entertainment and including 
emotion representation. The very thorough book by [2] notes on p. 246 that “very 
little is known about effective design techniques for personified interfaces”, and itself 
only deals with “professional interfaces”, and not at all with affection or emotion. 

Affective computing is concerned both with recognising, representing and acting 
upon, and expressing emotion. 

Batliner et al. [3] address recognition of user emotions and list several sources for 
this purpose including prosody, repetitions, reformulations, dialogue act sequences, 
facial expression, and words. They show that prosody alone is a fairly good indicator 
of emotion, but that adding repetition and reformulation cues may improve the error 
rate by one third. Contemporary commercial recognisers do not support prosodic 
features, but grammatical cues may be used. We have used this in [7], in order to 
apologise or provide more explicit help. 

Representation of emotion is common in the agent literature, e.g. [1, 9]. Davis’ [5] 
emotion engine that models each emotion by a cellular automaton, is a good example. 
Scheutz and Logan [11] describe three basic architectures for affective systems: The 
reactive architecture where input directly triggers actions; the affective architecture 
where explicitly represented emotions control the state; and the deliberative architec- 
ture where alternative actions are considered before any action is taken. They show by 
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experiments that the affective architecture is most effective, unless an advanced 
deliberative architecture is used, and even then the affective approach is better in a 
number of situations. Deliberative architectures are difficult to construct so that they 
sufficiently take into account all situations. The architecture of the Crazy-Kurt system 
would be categorised as affective. 



Table 4. Dialogue 2002-04-08-0-0-0, translated to the English. Legend: S system. U user, CL 
Change level, P Preference, Sat Satisfaction, 0 neutral, -1 cross. 



No. 


Utterance 


CL 


p 


Sat 


Act type 


SI 


Hello, it’s me. What do you want to know 
about me? 






0 


Introduction 


U1 


What is your name? 


0 


0 






S2 


Eh, try again. 






0 


Prompt 


U2 


Who are you? 










S3.1 


Well 






0 


Change marker 
(neutral) 


S3.2 


Do you really mean ”Who are you”? 






0 


Explicit feedback 


U3 


Yes. 










S4.1 


1 thought so. 






0 


Change marker 
(neutral) 


S4.2 


I’m Kurt the Kurterman, the toughest fruit 
eater in this country. 






0 


Answer 


S4.3 


What else do you want to know about me? 








Prompt 


U4 


You are stupid. 


0 


0 






S5 


Did you say “You are blue”? 






0 


Explicit feedback 


U5 


No, you are stupid. 


-1 


0 






S6.1 


Hn. 






-1 


Change marker 
(cross) 


S6.3 


You did say “You are stupid” didn’t 
you? 






-1 


Explicit feedback 


U6 


Yes. 










S7.1 


If only you’d find half a snail in your salad. 






-1 


Change marker 
(cross) 


S7.2 


No. 






-1 


Answer 


S7.3 


Holluja, what a fool you are. 






-1 


Prompt (passive) 


U7 


Bye. 










S8 


What? Are you leaving? We just had great 
fun. 






-1 


Closing 



The output generation we use for expressing emotion is simple (cf Section 2.6). 
Walker et al. [12] describe a more elaborate model based on speech acts as well as 
social variables and face. Also, Gricean maxims are commonly used as design prin- 
ciples for task-oriented dialogue [4]. In Gricean literature it is repeatedly stated that to 
flout maxims (i.e. to violate them on purpose) can be used to achieve effects like 
sarcasm and humour. 

Techniques based on recognition confidence scores are often used [2, 6, 8]. How- 
ever, the optimal use of score values is still only investigated in little detail. 
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Finally, there is a clear relation to Weizenbaum’s Eliza [13] and to the modem web 
chat-robots based on written input. However, Kurt is significantly different in at least 
two ways: (i) Kurt performs spoken conversation, and (ii) Kurt has a personality 
which means that there is a development over time in a conversation with him. 

5 Conclusion and Next Steps 

We have described the present version of a non-task-oriented conversational spoken 
dialogue system whose affective character, Crazy-Kurt, is meant to entertain users 
who call him and chat with him about himself, primarily about his preferences of 
food. In particular we have discussed techniques used in the modelling of Kurt’s 
personality and how his changing mood is reflected in his output. 

Despite several shortcomings it is our impression from experiments that users - 
and not least children - generally find the system entertaining. In particular they seem 
to enjoy the system’s repetition of what it understood and to provide some kind of 
insulting input to see how this affects Kurt and his output. 

We are fully aware that the present version needs improvement in several ways. As 
a first step we want to extend the grammar coverage which presently is quite poor and 
results in too many misunderstandings. Kurt even breaks the design rule that a system 
should understand its own language. 

But extending the grammar is not sufficient. We also want to extend Kurt’s know- 
ledge about himself since there are too many questions he cannot answer. An idea 
would be to give him quite detailed knowledge about one or more of his interests. As 
part of this we would like to investigate techniques which may help in achieving that 
users naturally stay within Kurt’s domain of knowledge. 

We consider enabling Kurt to ask questions about the user to obtain a more 
conversational interaction. Moreover, we would like to enable understanding of other 
kinds of user speech acts, such as swearing, to obtain a more varied dialogue and a 
more complete conversational character. 

The modelling of personality is so far fairly simple. We would like to experiment 
with the addition of a number of new parameters, including patience, balance, self- 
esteem, and pessimism/optimism. Which weight to give to each of the personality 
parameters included will depend on the personality one wants to model. For example, 
patience may be modelled via the three input events too long, silence, and barge-in. 
An impatient person is quite likely to interrupt the interlocutor and take the initiative. 
A patient person, on the other hand, tends to listen until the interlocutor has finished 
his turn. Thus an impatient system may have a low limit for too long input. Also an 
impatient system may only accept very short silence from the user before it takes the 
initiative and perhaps directly asks why the user does not answer. The system’s 
degree of patience may decrease in case of misunderstandings and if the user 
interrupts (via barge-in') too often since it is annoying to be interrupted many times. 



' Though barge-in is supported by most speech platforms, the event is not explicitly accessible 
to the dialogue designer. In Kurt as well as in task-oriented dialogue it is important in meta- 
communication to know how much of the phrase and information was actually output. In 
HDDL it is possible to circumvent the lack of this accessibility using intricate programming. 
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If the system’s patience falls below a certain threshold it may even disable barge-in 
and tell the user that now he has to listen to the system for a moment. 

Other ideas we would like to consider include entirely new kinds of interaction 
patterns, such as letting Kurt hand over the phone to a colleague if he becomes too 
bored or realises he is on the verge to be disclosed as incompetent. Or we may have 
sessions where the system detects that it was called before from the same number or 
by the same person. This is realisable via storing the telephone A-numbers and, on 
certain platforms, via speaker recognition techniques. 

The next version of Kurt will include many of the above extensions. We will test 
how well they work in practice and experiment with the weighting between different 
personality parameters as well as with the entertaining features of the system. 

In summary, today’s speech technology opens up for new classes of applications 
that exploit affective and conversational dialogue, either in terms of new kinds of 
non-task-oriented applications in the “-tainment” area, i.e. in entertainment, 
edutainment and infotainment, or in terms of - in more subtle disguise - improve- 
ments regarding dialogue robustness in current task-oriented systems. 
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Abstract. While most dialogue systems restrict themselves to the adjustment of 
the propositional contents, our work concentrates on the generation of stylistic va- 
riations in order to improve the user’s perception of the interaction. To accomplish 
this goal, our approach integrates a social theory of politeness with a cognitive 
theory of emotions. We propose a hierarchical selection process for politeness 
behaviors in order to enable the refinement of decisions in case additional context 
information becomes available. 



1 Introduction 

The last decade has seen a general trend in Human Computer Interaction (HCI) to emulate 
certain aspects of human-human communication. Computers are ever less viewed as tools 
and ever more as partners or assistants to whom tasks may be delegated. Empirical studies 
by Nass and colleagues [1] show that users tend to apply social norms to computers. They 
blame computers for mistakes, they try to be polite if the computer asks them for their 
opinion and feel flattered if the computer compliments them on a good performance. 

In order to build dialogue systems that are able to communicate with the user in 
a more natural manner, the consideration of social aspects is inevitable. Martinovsky 
and Traum [2] demonstrated by means of user dialogues with a training system and a 
telephone-based information system that many breakdowns in man-machine commu- 
nication could be avoided if the machine was able to recognize the emotional state of 
the user and responded to it more sensitively. Earlier experiments by Prendinger and 
colleagues [3] have shown that an empathetic computer agent can indeed contribute to 
a more positive perception of the interaction. 

There has been an increasing interest in the development of spoken dialogue systems 
that dynamically tailor their conversational behaviors to the individual user and the 
current situation (see [4]). Most approaches focus, however, on the adaptation of the 
content to be conveyed and pay hardly any attention to stylistic variations. Furthermore, 
social and psychological aspects have been largely neglected so far. 

Waibel and colleagues [5] present a first approach to adjust dialogue behaviors to 
the user’s emotional state. For instance, they suggest that more explicit feedback should 
be given if the user is frustrated. Nevertheless, their approach relies on a few selection 
rules and is not based on a general framework for affective dialogue. Walker and collea- 
gues [6] examine how social factors, such as status, influence the semantic content, the 
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syntactic form and the acoustic realization of conversations. They consider the speaker’s 
emotional state to parameterize the acoustic realization of speech acts. However, they 
don’t attempt at manipulating the hearer’s emotional state by the deliberate choice of 
dialogne strategies. 

In the HCI community, there have been varions attempts to create virtual agents 
that display emotions via mimics, gestures and speech. Most of this work was driven 
by the goal to increase the agents’ believability (e.g., see [7,8,9]). There are only a few 
agents that try to elicit user emotions in a controlled manner. For instance, the COSMO 
agent [10] intentionally expresses emotions with the aim to enconrage the student while 
the GRETA agent [11] deliberately decides whether or not to show an emotion. Prendiger 
and colleagues [12] apply so-called social filter rules to determine the intensity of the 
emotion to be displayed depending on factors such as the personality of the agents and 
the social distance between them. 

The objective of our work is to endow dialogue systems with emotional intelligence. 
Emotional intelligence includes the ability to recognize the user’s emotional state as 
well as the ability to act on it appropriately. In this paper, we investigate how the user’s 
affective response to a system may be improved by the mitigation of face threats resulting 
from dialogne acts. 

2 A Theory of Social Interaction 

According to Brown and Levinson [13], politeness strategies are communicative devices 
for redressing the threats inherent in verbal and nonverbal utterances. Positive politeness 
aims at protecting the individnal’s desire to be evaluated positively, for example by 
expressing admiration for the addressee. Negative politeness accounts for the individual’s 
desire to act free from impositions, for example, by emphasizing that the final decision 
is up fo the addressee. 

Walker and colleagnes [6] have shown how the Brown and Levinson approach may 
be successfully applied to the implementation of simulated dialogues between conver- 
sational agents. Depending on the expected threat to the user’s basic desires, different 
dialogue styles are realized. Walker and colleagues choose one of the following four 
strategies: 

1 . Do the speech act directly 

2. Orient the realization of the act to the hearer’s desire for approval (positive politeness) 

3. Orient the realization of the act to the hearer’s desire for autonomy (negative poli- 
teness) 

4. Do the act off record by hinting facts and/or ensuring that the interpretation is 

ambiguons. 

Each of these main strategies has a number of substrategies that may be realized 
by different lingnistic means. According to Brown and Levinson, strategies come with 
constant thresholds. Consequently, the main and substrategies can be organized into a 
hierarchy according to their increasing thresholds. 

We attempt to remedy a shortcoming of the approach by Walker and colleagues. 
They work with purely hypothetical values from 0 to 150 without accounting for the 
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Fig.l. A section of the hierarchy of strategies along with the thresholds. 



origin of these values. We will start with a linear distribution of thresholds and check 
this assumption later against empirical evidence to come to a more realistic distribution. 
Moreover, our selection process is more fine-grained since we include the relevant sub- 
strategies in the selection process. Figure 1 gives an impression of the resulting hierarchy 
if a linear distribution of thresholds over the given strategies is assumed. For example, 
three approval oriented substrategies exist: 

2. 1 . Claiming common ground, 

2.2. Conveying that speaker (S) and hearer (H) are cooperators, and 

2.3. Fulfilling H’s want. 

Each of these strategies can in turn be realized in different ways. Strategy 2.1, for 
example, may result in conveying that some desire of the speaker is also very interesting 
to the addressee or in claiming in-group membership with the addressee. To perform 
a speech act, the strategy with the highest threshold that does not exceed the threat is 
chosen. Consequently, the more serious a face-threat is the higher the threshold of the 
strategy should be. For instance. Strategy 2. 1 would be chosen if the face threat was 
not too serious. Vice versa, using a higher-ranked strategy implies to the hearer that the 
threat is more serious because the interactors share knowledge about the organization of 
strategies. 

The threat resulting from an utterance depends on the social distance between the 
speaker and the hearer D(S,H), the power that the hearer has over the speaker P(H,S) 
and, finally, on a ranking of imposition for the dialogue act under discussion Rq, (a 
denotes the dialogue act). For instance, a command is ranked as a higher imposition than 
an offer. 

In the setting of a computational system it is debatable whether D and P should be 
regarded as variables at all. Power definitely resides with the user that can simply switch 
the system off and attributing aspects like social distance to a technical system seems a 
bit far-fetched. But this perspective is only viable if users do not regard computational 
systems as interaction partners. Reeves and Nass [1] have shown in a convincing manner 
that this is actually the case and that users even contribute social behaviours to computers 
which they may or may not have in reality. Physical power (switching off) resides with the 
user but during the interaction between user and system the power of the roles played by 
system and user is the important aspect. In a tutoring scenario, e.g., the greater (social) 
power resides with the teacher who guides and challenges the pupil throughout the 
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learning process. Consequently, in a tutoring system, the user has to accept the greater 
power of the system due to the social setting. 

A similar effect can be observed for social distance. A user being confronted with 
a system for the first time is not yet familiar with its technical features and probably 
unsure how to interact with it. This is analogues to meeting a new collegue at work. If 
you talk to him for the first time, you might ask: “Excuse me, may you want to join me 
for lunch”. The colloquial “Hey mate how about lunch” reflects a much lower social 
distance between the two conversational partners. In both cases, a lower bound for the 
two variables may be defined that is due to the roles of the interaction partners. Although 
teacher and student may come to know each other better over time, which will decrease 
the social distance between them, this will not result in peer-group relations. 



3 Integration of an Affective Component 

Walker and colleagues consider the speaker’s emotional state to parameterize the acoustic 
realization of the speech act. However, the emotional state is not calculated due to 
situational or pre-defined personality factors and it has no effect on the choice of strategies 
at all. Instead, it is set as a fixed parameter for a given agent and only effects the 
acoustic output which becomes more variable in this way. Moreover, they don’t attempt at 
manipulating the hearer’s emotional state by the deliberate choice of dialogue strategies. 

Our work starts from the assumption that the perceived threat resulting from a speech 
act heavily depends on the user’s emotional state. For instance, if the user is already rat- 
her irritated due to communication problems, a proposal by the agent to input a long 
identification number is rather likely to be perceived as an impingement. Furthermore, 
knowledge about the causes for the user’s emotions should guide the selection of poli- 
teness strategies. Consequently, the emotional state is a factor that emerges during the 
interaction and dynamically influences the ongoing dialogue. 

We represent emotions using a dimensional model (see [14]) which characterizes 
emotions by the two orthogonal dimensions valence and arousal (see Fig. 2). Valence 
indicates to which degree an emotion is positive or negative and arousal refers to the 
extent of the emotion. A given emotional state is then characterized by a point in this 
two dimensional space. Emotion dimensions can be seen as a simplified representation 
of the essential properties of emotions. For instance, anger can be described by high 
arousal and negative valence. 

Apart from the ease of describing emotional states that cannot be distributed into 
clear-cut fixed categories, the two dimensions valence and arousal are well suited for 
emotion recognition (see [15]). 

Given this two dimensional model of emotions, how does the user’s emotional state 
influence the choice of an appropriate strategy? To calculate the weight of the face 
threat 6, Walker and colleagues follow Brown and Levinson’s proposal which takes 
the variables social distance, power, and ranking of the speech act into account: 0 = 
D{S, H) + P{H, S) + Ra- Apart from D, P, and R, Brown and Levinson mention 
situational factors that may influence the given variables to fit specific contexts and 
situations, but leave aside the question of how this influence is reflected. We treat the 
emotional state of the user as such a situational factor. Integrating this information in 
the weight calculation renders them to multiplicative factors that influence the other 
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Fig. 2. The dimensional model with some example emotions. 



variables. In our first approach, we assume the emotional state of the user E(H) to 
influence all three variables to the same amount according to the following heuristic: 

- Positive valence: decrease the weight of threat 6 

• high arousal (e.g., joy): A possible threat is of no great significance to the user 
because he is in a very positive state. Thus, the decrease is high, 0 < E{H) < 
0.5. 

• low arousal (e.g., hliss): Like above, the user is in a positive state, the threat will 
affect him not with its full weight, but the decrease is low, 0.5 < E{H) < 1. 

- Negative valence: increase the weight of threat 0 

• high arousal (e.g., anger): The user is more sensitive to possible threats. The 
weight of the threat is increased by a large amount, 1.5 < E{H) < 2. 

• low arousal (e.g., sorrow): The user is still sensitive to threats, but to a lesser 
degree then before. Accordingly, the increase is low, 1 < E{H) < 1.5. 

The question arises of whether the emotional state has a different influence on di- 
stance, power, and imposition. For example, anger might influence P(H,S) to a strong 
extent while sadness might have a significant impact on the variable R. Leaving these 
subtleties aside for the moment, our agent S estimates the threat 0 to the hearer H when 
performing the speech act a as: 

^ _ f 1 : > 1 

\9e '■ otherwise 

with 0E defined as: 0e = E{H) * 1{D{S, H) + P{H, S) + 

4 Illustration of the Model by Means of a Dialogue Example 

To illusfrate fhe approach, we sfarf from an example faken from a Spoken Language 
Dialogue System (SLDS) developed by Biihler and colleagues [16] for appointment 
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scheduling. Let us assume that the user has already fixed an appointment in the city of 
Heidelberg at a certain time and wishes to arrange another appointment the same day. 
Based on its reasoning process, the SLDS generates the following dialogue utterances: 

System: You already have an appointment in Heidelberg tomorrow. 

System: Where do you want to go first, Stuttgart or Heidelberg? 

The user reacts by evaluating the first alternative as a hypothetical scenario: 

User: How much time would I have in Stuttgart if I went there first? 

After processing the travelling constraints, the system is able to provide the requested 
information: 

System: You would have at most 30 minutes. 

Now let us discuss how the enhanced dialogue system would interact with the user. 
Let us assume the social distance between system and user is very low and that the 
system has power over the user, i.e. D(S,H) = 0.1, P(H,S) = 0.3. The system decides to 
respond with an inform speech act to which a rather low imposition value is assigned: 
R = 0.3. We assume that the user has no preference for a specific schedule and is just 
exploring different options. Telling the user that there are only 30 minutes left would not 
necessarily result into a negative user emotion. Consequently, the face threat won’t be 
influenced by the user’s emotional state and the threat resulting from uttering an inform 
speech act is rather low: 0 = 0.23. Therefore, the system would go for an effective 
solution and perform the act directly by responding: 

(1) System: You would have at most 30 minutes. 

Let us now assume that the user is in a rather bad emotional state due to repea- 
ted communication problems, i.e., negative valence and high arousal yield E(H) = 1.6. 
Consequently, 0 = 1.6 * 0.23 = 0.37. Instead of a direct strategy, the application of an 
approval-oriented strategy is more advisable. According to Figure 1, the system decides 
to convey the impression that it and the user are cooperators: 

(2) System: I know this is quite irritating. 

System: But, we would just have 30 minutes. 

Let us now start from a setting in which the social distance between system and user 
is high and in which the user has power over the system: D(S,H) = 0.8, P(H,S) = 0.8. 
Furthermore, we assume that the user insists on going to Stuttgart first. 

User: / want to go to Stuttgart first. 

In this case, the system would have to reject the user’s request. The imposition 
associated with a refusal is ranked quite high: R = 0.8. In addition, the user will be in 
a rather bad emotional state since his goals cannot be accomplished: F(H) = 1.8. As a 
consequence, the threat resulting from the speech act is maximal: 0 = 1.0. According 
to Fig. 1, the system would thus decide to use an off-record strategy. As already noted 
by [6], off-record strategies are most difficult to realize computationally. One possibility 
is to provide an evasive answer and to distract the user by mentioning the advantages of 
another option: 
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(3) System: If you go to Heidelberg first, you will have enough time for the 
meeting in Stuttgart. 

Let us now assume that the user is in a rather relaxed and joyful state because he 
has just closed a profitable deal. Consequently, a change in the schedule won’t have a 
great impact on his emotional state: E(H) = 0.7. In this case, the emotional state will 
decrease the threat resulting from the speech act: 0 = 0.8 * 0.7 = 0.56. Therefore, the 
system would prefer an autonomy-oriented strategy. 

(4) System: Would it be ok if you had just 30 minutes in Stuttgart? 



5 Consideration of Additional Context Factors 

Even though the theory by Brown and Levinson seems to provide a promising basis for 
the implementation of politeness strategies, it also creates a number of problems (see [17] 
for a more detailed discussion). Eirst of all, the linear ordering of politeness strategies 
from direct over approval-oriented and autonomy-oriented to off-record may lead to 
inconsistencies. There is no doubt that the perceived politeness of any strategy may 
drastically vary depending on a number of context factors, such as the user’s personality, 
his or her ability to perform a certain task etc. Indeed, a number of studies revealed that 
autonomy-oriented strategies are not always conceived as more polite than approval- 
oriented strategies. Eurthermore, speakers tend to use several combinations of politeness 
strategies within one utterance. Even a dialogue act that is aimed at sustaining negative 
face of the hearer can be employed in an approval-oriented strategy. It is also questionable 
whether indirectness and vagueness are actually useful means to redress face threats. 
For instance, in the calendar scenario discussed above, vague system utterances might 
even increase the user’s negative emotional state. 

As a hrst step to improve the selection process, we consider not only the user’s 
emotional state, i.e. the values for the dimensions valence and arousal, but rely on a 
specification of his or her personality prohle and a classification of the events, actions 
and objects that caused it. This approach is in line with the OCC cognitive model of 
emotions [18] which views emotions as arising from some valenced reaction to events 
and objects in the light of agent goals, standards, and attitudes. 

For example, a student that feels frustrated after repeatedly providing wrong answers 
to a tutoring system might interpret an autonomy-oriented strategy, such as “Why not 
try it again?”, as pure irony. Since the event is not desirable to the user, but the user is 
responsible for the failure, the system would rather go for an approval-oriented strategy. 
On the other hand, if the failure is obviously caused by the system, the user might feel 
offended if the system applies an approval-oriented strategy and tries to convey the 
impression that they are collaborators, for instance by uttering: “Don’t worry! Together, 
we will manage this!”. Here, an excuse would be more appropriate. 

The model that we have elaborated above can be regarded as the default case of emo- 
tional influence on strategy selection. It is employed if not enough knowledge about the 
interlocutor or the situation is available. For instance, a tutor might opt for an approval- 
oriented strategy because the student is lacking self-conhdence. But such a strategy can 
still be realized in a number of different ways which mitigate face threats of different 
weights. Without any additional knowledge, we would now compute the expected face 
threat based on a rehnement of the formula presented above. 
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Fig. 3. A section of the modified hierarchy of strategies in the extended model along with the 
thresholds. 



g, _ f 1 : > 1 

* \0Ei '■ otherwise 

with dsi defined as: hsi = E{H) * ^{D{S,H) + P{H,S) + Rai) with i 
signifying either positive or negative politeness, i.e., i G {pos, neg}. 

Figure 3 gives a slightly modified version of the original choices. Depending on the 
available knowledge, strategies are either deliberately chosen by the speaker (knowledge- 
driven) or determined by the given thresholds (weight-driven). The thresholds are as- 
signed dynamically in a linear fashion to the appropriate level in the hierarchy of stra- 
tegies. Figure 3 gives an example where the choice between direct, approval-oriented, 
autonomy-oriented, and off-record is decided by the speaker in favor of an approval- 
oriented strategy leaving the application of the thresholds to the levels further down in 
the hierarchy. The dotted arrows indicate that the division between knowledge-driven 
and weight-driven decisions can shift in either direction. 

Another attempt to consider a larger variety of context factors for the selection of 
politeness strategies has been proposed by Johnson et al. (see [19], this volume). They 
leave the choice between positive and negative politeness strategies to the deliberate 
decision of the speaker by calculating two FTA-weights that take different aspects of the 
speech acts and the choices of the speaker into account. Their extension of the Brown and 
Levinson model rests on observations of tutor-learner dialogues in a computer mediated 
learning environment. Consequently, the choice of an appropriate strategy in their model 
is driven by the knowledge of the tutor about the situation and about the learner. It is 
based on an analysis of the linguistic means employed by the tutor, leaving situational 
factors like the emotional display of the learner aside. 

A straightforward combination of their and our approach would be to rely on their 
mechanism for choosing between approval- and autonomy-oriented strategies, but guide 
the selection of substrategies by the expected face threat on the user. 



6 Conclusions and Future Work 

In this paper, we have presented a new dialogue model that integrates a social theory of 
politeness with a cognitive theory of emotions. The objective of our work is to improve 
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the user’s subjective perception of the interaction by mitigating face threats that result 
from dialogue acts. The acceptance of a system that simulates certain aspects of social 
behaviors certainly depends to a large degree on the relationship between the user and the 
system. A user who views the system rather as a tool might be more interested in getting 
straight answers while a user who regards the system rather as a personal assistant may 
appreciate the system’s attempts to exhibit certain kinds of social behavior. 

To shed light on this question the next step of our work will be the conduction of an 
empirical study that compares the user’s response to the affective and the non-affective 
version of the dialogue system by measuring and analyzing his or her physiological 
feedback. Therefore, while interacting with the two different system versions, the user 
is monitored by means of bio sensors, capturing skin conductivity, heart rate, respiration 
and muscle activity. 

We aim at measuring the user’s physiological feedback since self reports bear the 
disadvantage that users might not be able or willing to describe their true sensations. 
Furthermore, there is the danger that users might not remember the experience any 
more when asked after the interaction with the system. The advantage of bio signals as 
indicators of emotions, compared to external means of expression, is that they are hardly 
consciously controllable by the user and thus allow us to largely circumvent the artifact 
of social masking. 

The goal of our studies will be to find correlations between patterns in the phy- 
siological state - as measured by means of bio sensors - and the threat resulting from 
the dialogue acts produced by the system. We hypothesize that an increased number of 
greater threats over a longer period of time will gradually result into user stress. Such a 
finding would also be in line with earlier experiments by [3] who showed by measuring 
the user’s skin conductivity that users seem to be less stressed if the agent apologizes 
for delays. Even though biofeedback analysis may help to acquire knowledge about the 
user’s emotional state while he or she is interacting with the system, our primary goal is 
to make use of this method to evaluate the affective impact of our politeness strategies 
and employ the resulting knowledge for guiding the selection and definition of politeness 
behaviors. 

Our approach leaves a lot of room for extensions. First of all, face threats are in- 
herently multi-modal. Dressing up a threat in a joke usually only works if the speaker 
shows in his whole appearance (facial expression, body posture) that he is telling a joke. 
Otherwise the threat might be even more severe than it is. To identify multi-modal con- 
versational behaviours to control the strength of perceived face threats, we are currently 
preparing a corpus study which will also help us to ground the selection of politeness 
behaviors in empirical data. 



Acknowledgments. We would like to thank Lewis Johnson for fruitful discussions on 
the implementation of politeness behaviors. 
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Do You Want to Talk about It? 

A First Step towards Emotion Integrated Dialogue 
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Abstract. In this paper, we describe an implemented system for 
emotion-referring dialogue. An agent can engage in emotion-referring 
dialogue if it first has a model of its own emotions, and secondly has 
a way of talking about them. We build this facility on to the virtual 
humans in the MRE project, building upon the existing emotion and 
dialogue facilities of these agents. 



1 Introduction 

It is widely recognized that in order to make agents more human-like, the use 
of emotions is essential. Although some work has been done in giving agents 
emotions and having those emotions influence the agent’s behavior and dialogue, 
few systems offer the agents a chance to actually talk about their emotions, to 
say they are sad for example. We call this emotion-referring dialogue. 

In this paper, we present a first step towards emotion-referring dialogue. 
As a vehicle, we use the Mission Rehearsal Exercise (MRE) immersive learning 
environment. MRE features virtual humans, autonomous agents the user can 
enter into conversation with [5]. The virtual humans possess a deep process 
model of emotion, EMA, that not only derives a current emotional state, but 
represents an “explanation” of how the emotion arose from features of the social 
and physical environment and an agent’s beliefs, desires and intentions [1] [4]. 
Till now, an agent’s emotions could be conveyed non-verbally by gestures, body 
language and facial expressions and also influence both the interpretation and 
generation of dialog. However, the agent could not explicitly communicate its 
emotional state or respond to questions about it. We show how emotions are 
made explicit by implementing a question-answering capability, enabling the 
agent to verbally express his feelings when confronted with questions like “How 
do you feel?” and “Why are you mad, it’s such a lovely day?”. 

2 Emotion-Referring Dialogue 

A first step in designing emotion-referencing dialogue is deciding which kinds of 
information could be conveyed. Following appraisal theory [2] , emotions in EMA 
arise from an agent’s interpretation of its relationship to the physical and social 
environment. “Cognition” informs this interpretation by inferring (via domain- 
independent reasoning mechanisms) how events impact the agent’s goals, beliefs. 
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etc. For example, some consequence of an action could, with some probability, 
inhibit or facilitate a goal, leading to joy or distress. These events and their 
implications are appraised in terms of a number of specific dimensions, including 
desirability, likelihood, blame, sense of control, etc. These appraised dimensions 
then determine an instance of emotion of a specific type (anger, distress, etc.) 
and intensity. This multi-level representation maintains not only the top-level 
emotional response, but its justification in terms of appraisal dimensions and 
lower-level domain propositions and relationships. 

The original dialogue capability in MRE virtual humans was designed to sup- 
port team task interaction, including orders and negotiation over actions, and 
questions and answers about actions and states. For example, (1) is the repre- 
sentation of a state proposition (a boy is not healthy). (2) is the representation 
of a question “who is hurt”, which can be answered using knowledge such as (1). 

(1) (C262 "attribute health-status "object-id boy "polarity negative "time pre- 
sent "type state "value healthy) 

(2) a. (S8290 "prop P7433 "q-slot object-id "type question) 

b. (P7433 "attribute health-status "polarity negative "time present 
"type state "value healthy) 

As a step towards talking about emotions, we derive emotion state repre- 
sentations like (3) from EMA. Using these state representations required some 
extensions to the current-state model. Previously, these state objects were 
present throughout the run-time of the agent, with different belief values (true, 
false, or unknown). While this is feasible for a fairly small closed task, it is dif- 
ficult for representing emotions, since the intensities of different emotions take 
on continuous values, with many changes as time passes and events occur. Thus 
we have created new on-the-fly states which only exist while the question is 
under discussion. These represent the emotion type and intensity, but also the 
justification structure underlying the emotion instance. This allows the current 
dialogue question answering mechanisms [6] to be used as-is, without creating 
an unwieldy number of extra states. 

(3) (C262 "attribute max-feeling "object-id <self> "type state "value 
<emotion>) 

As we are still experimenting with the appropriate character of emotion- 
referring dialogue, and as the existing natural understanding and generation 
routines do not support such dialogue, we have currently constructed special 
purpose techniques to handle these capabilities. Rather than using the existing 
semantic parser, which would have required either adding to a finite state gram- 
mar, or providing training data, or both, a keyword spotting parser was built 
specifically to look for emotion-related questions. For every question a set of both 
present and missing keywords is defined, which uniquely identifies that question. 
The input is scanned for these keywords, and if a sufficient match is found, a 
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semantic representation similar to (3) is created. This is then used to match 
against possible states that could be used to answer the question. For now, most 
of the semantics will result in the generation of just one reference. When am- 
biguity about the user’s intention arises, rules will select the most appropriate 
state: all possibilities that do not lead to an emotion with a certain threshold 
intensity will be discarded, and if multiple high intensity states exist, one will 
be picked randomly. The system will then update the dialogue state, after which 
the natural language generation phase is entered, when the dialogue manager 
decides to answer the question. 

In the generation phase, again we implemented a special-purpose routine rat- 
her than create the knowledge need to do full linguistic realization. We designed 
a template system that could take the values in the emotion state to chose spe- 
cific filler words for slots rather than modifying the existing realization system 
to be able to produce emotion descriptions. The emotion state serves as the 
main source of information, providing, where available, the emotion type and 
intensity, and additional information about the associated state. Special look-up 
tables translate these information bits into strings of natural language. 



3 Results 

The implemented emotion dialogue system allows the user to ask a variety of 
questions, e.g. which emotion an agent feels most, if he feels a certain emotion, 
how he feels about a certain state, what causes him to feel a certain way, etcetera. 
These types of questions can be asked in various ways. For instance, the question 
“Are you worried?” could also be rephrased as “Do you feel concerned?” or “You 
seem upset.” - all map to the same semantics. They are answered depending on 
the current state of the system. If, for instance, the user wants to know who is 
making the agent feel a certain emotion, the agent could answer that he in fact is 
not feeling that emotion at all, point to a certain person, blame a certain person, 
or state that no one in particular is responsible. Personality also plays a role, 
particularly in the amount of information an agent conveys. For example, if the 
agent feels worried (for instance, because the agent is helping an accident victim 
that may die), the question “Are you worried?” could be answered with “Yes, 
sir.”, but another possible answer is “Yes, sir, because the boy is probably going 
to die.”, depending on certain system parameters that modulate the generation 
of dialogue. 

(4) illustrates a dialogue between the implemented system and a user (acting 
as the system’s boss). 

User Agent 

What happened here? There was an accident, sir. 

(4) Who’s hurt? The boy and our driver. 

How do you feel about the boy? I’m feeling worried about him. 

Why? The boy has critical injuries, sir. 
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4 Conclusion and Future Research 



The key lesson from this work is that having a deep model of emotional reaso- 
ning can straightforwardly enhance the naturalness and expressivity of automa- 
ted dialogue systems. With a natural language framework in place, we showed 
that an already existing emotion structure can be used for an emotion-referring 
dialogue with relative ease. The mentioned quest ion- answer structure enables 
the user to ask an agent a variety of questions about it’s internal state, getting 
feedback in a natural way. Although we are still far from a complete natural im- 
plementation, this first step gives us valuable insights on how to proceed. New 
utterances can be added quite easily by adding new emotion states and natural 
language rules, resulting in the possible use of utterances like “Why are you mad 
at the mother?” The use of keyword scanning and templating delivered relatively 
quick results when using a small, known domain and enabled us to make easy 
changes as there’s no need for training natural language modules. 

The current implementation of the emotion dialogue has several limitati- 
ons. Some could be eliminated by integrating it fully into the MRE system and 
making more use of the richness of the dialogue and emotion models. Others 
require further developments to the emotion and dialogue modules, including 
appraisals of obligations, justifications, (see [3]), excuses, and questions about 
coping strategies. 

The following dialogue, suggested by a anonymous reviewer, provides a good 
starting point to illustrate these limitations: 



User 

User: How do you feel about me? 

User: Why? 

User: But it was not my fault! I had 
to finish some urgent work. Why 
do you get angry for such a stu- 
pid thing? 



Agent 

Agent: I’m feeling angry with you 

Agent: you didn’t come to the meeting. 

Agent: Because I’m a serious and ca- 
reful person: I hate waiting. 



The first two exchanges are fully supported (assuming an inclusion of an 
’’attend meeting’ action in the domain model that the user failed to execute). 
Implicit in this dialogue is that the user had an obligation to attend the meeting 
(probably acquired through dialogue). Although the current dialogue system 
maintains such obligations, they are not as yet appraised directly by EMA. See 
[3] for some developments along these lines. The user’s last utterance includes a 
rejection of blame (representable in EMA), an excuse (not currently represented 
(c.f. [3]) ), and a question about the reason for an emotion (well beyond the range 
of our model). The agent’s reply includes a justification which is also beyond 
EMAS current capabilities. A remaining issue concerns the naturalness of the 
resulting dialogue. For a variety of reasons, people often do not communicate true 
emotional state unaltered. People may be in denial about their actual state, 
or may chose to express something other than their actual feelings to shift- 
blame, manipulate social interactions or seek social support. EMA explicitly 
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models such emotion-induced coping strategies, and thus provides modeled a 
natural mechanism to incorporate such an ’’intentional stance” toward expressing 
emotion. For example, an agent in MRE may cope with anger by applying a 
coping strategy of denial. When asked about this emotion, a natural response 
might be, ”I’m not angry!” 
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Abstract. In the present issue we discuss a rule-based model for emotional 
dialogue simulation. The model has been originally developed for description of 
affective (emotional) processing of mass media texts and also applies to several 
other types of emotional communication including conflict, complaint and 
speech aggression. In the proposed model we distinguish rules for “rational” 
inference (r-scripts) and rules for “emotional” processing of meaning 
(d-scripts). We consider that “affective” semantic components in text meaning 
are recognized by d-scripts and cause their activation, thus simulating speech 
influence. On the other side, d-scripts define “affective” semantic shifts in texts, 
which are produced in an emotional state or aimed to affect the listener. 



1 Introduction 

The interest to functional interpretation of emotional communication is recently 
growing in several areas of linguistics and computer studies. Linguistic studies of this 
area are explained not only by the desire to describe new material, but also by the 
requests from applied areas: above all, theory of advertising and practice of juridical 
linguistic expertise. Our interest in studies of affective text processing was initiated by 
requests from state and public institutions to work out a procedure of juridical 
linguistic text expertise for court examination on honour protection cases. As we 
expect the same theoretical model could be used to construct artificial agents, 
experiencing speech influence and producing induced utterances. 

In our studies of natural text influence we have proposed a theoretical model with 
distinguished procedures for “emotional” and “rational” processing of text. The 
theory was intended to provide a theoretical base for a work of an expert during a text 
expertise on honour protection cases, information disputes or where it can be 
supposed, that a certain text is aimed to affect the audience of mass media. 

In the proposed model the speech influence of mass media texts is described in a 
unified way with some other types of emotional communication, in particular - with 
some types of conflict, complaint, and speech aggression (varying in distribution of 
actants). In all these cases the model simulates speech influence and generates texts 
with the help of d-scripts - units for emotional meaning processing, which are 
opposed to r-scripts - units for rational text processing. 
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The model accepts text on input and constructs semantic representation with the 
help of Text-to-Meaning processor [6]. The model contains a number of rules - 
scripts - which can be activated by the constructed meaning. A script is activated by a 
meaning, which corresponds (though not always exactly) to the starting model of this 
script. Rational inferences from the meaning of incoming text are simulated by 
rational scripts - r-scripts. The other group of scripts is intended to simulate 
emotional processing of incoming texts. Following [3] we consider, that such units, 
responsible for simple, emotional reactions dominate during information processing, 
and they are designated as dominant scripts or d-scripts. The model reacts on 
utterances like (1) The government is lying to you! and (2) The mayor thinks only 
about himself- this reaction is described as an activation of specific d-scripts: phrase 
(1) activates d-script DECEPT (“Deception”) and phrase (2) activates d-script SUBJV 
(“Subjectivity”). Following our studies of mass media texts, we have selected a list of 
13 d-scripts, quite sufficient to cover most of the negative propaganda in mass media.' 




Fig. 1. General structure of the model for description of affective text processing 

In addition to the behavioural output the model provides speech output from its 
internal components. In particular, the starting and target models of a d-script serve as 
meaning sources for the generation of emotional text. 

The proposed architecture with the distinction of “rational” and “emotional” 
processing of meanings corresponds to the FI-CogAff model discussed in [3, 4]. In 
particular, the mechanism of d-scripts is quite similar to alarm system in CogAff 
architecture and the concept of protospecialists in [2, p. 163-172]. D-scripts in the 
same way detect “critical” meanings in incoming texts (or in the results of inference, 
constructed during rational processing) and initiate affective reactions. 

We will become angry, if a person, who is our neighbour or relative, thinks only 
about himself or about some useless matter. The same corresponds to subjective 
actions of our boss or govermnent. In this way we can be affected by utterances such 
as (3) Your farther thinks only about his football or (4) The government is concerned 



' See: http://www.harpia.ru/d-scripts-en.html 
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only about it’s salary. The influence of such texts is simulated by the activation of 
one of the d-scripts: SUBJV (Subjectivity). The starting model of this script describes 
the situation of ‘subjectivity’ and includes slots AGGR - for the person or entity, 
whose actions seem to be subjective, and VICT - for the person, who is affected: 

SUBJV (AGGR, VICT [, Mi/PacGR] 1, [, ) : 

AGGR doesn't consider relevant factors of the situation 
and is effecting or is going to effect [all the 
possible] actions Paggr [upon discovering of situation Mr 
or to achieve a goal M 2 ] ; AGGR and VICT are linked with 
a relation Raggr-vict- 

Starting model of a d-script contains a list of critical elements - semantic markers, 
which describe preferable semantic shifts in emotional texts. For example one of the 
critical elements detects (and also produces) increased intensity of the main verb, 
which can mark implicit aggression of the speaker: Why do you speak? - neutral, vs. 
Why do you shout? - emotional [5]. A dialog-oriented realisation of the model could 
accept a phrase on input, activate a d-script, which is closer to it’s meaning 
(suppressing alternative r-scripts), shift the meaning following the list of critical 
elements defined for the d-script and provide the text with shifted meaning on output. 



ouTPur 



- They all alv\eys 
I shout only about their 
I budget/ sudi trifles! 



INPUT 

- The governnent is 
vvorking on the budget. 




d-script starting rrodel 
vyth aitical el errant s 



- 

^ ^ possible \ 

r- scriot 



» 'the budget rray 

— ^7 Inference j be approved i 
- ^ 



, — “rt"' • 

/ about one rronth' 



Fig. 2. Activation of SUBJV d-script by a “neutral” text and feedback, experiencing several 
semantic shifts, as defined by the critical elements of SUBJV 

We consider that examples with explication of critical elements are better 
recognised by d-scripts during text perception and can be easily constructed in an 
emotional state (following the activation of the respective d-script). 

The same mechanism is used in several types of emotional communication, 
varying in the distribution of AGGR and VICT slots between the communicants. On 
one hand the proposed model activates d-scripts when receiving texts like (5) Your 
mother doesn 't tell you the truth (it will consider itself as a victim - VICT, while 3'^'* 
person ‘mother’ - as an aggressor, AGGR) on the other hand, it can produce texts like 
(6) The government is always lying to you! (when trying to affect the opponent and 
make him believe, he is a victim of some other aggressor) or (7) You are always lying 
to me! (in conflict communication, where the addressee is believed to be the 
aggressor). Different types of AGGR/VICT distribution are represented in Table 1. 
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Table 1. Expectations of addresser on the distribution of AGGR and VICT slots between the 
participants of communication 





Addresser 


Addressee 


3'"‘‘ party 


Thesis of addresser 










Conflict. You are concerned only 


1. 


VICT 


AGGR 


- 


about your football! 

You, politicians, are only concerned 
how to speak well! 








2. 


VICT 


- 


AGGR 


Complaint. My husband speaks 
only about his football! 










“Communication of victims”.^ 


3. 


VICT 


VICT 


AGGR 


Modern youth think only about 
dances! 


4. 




VICT 


AGGR 


Influence. The Government thinks 




only about their taxes! 


5. 


AGGR 


VICT 


- 


Aggression. I’m going to kill you! 
(for d-script DANGER) 



In the present issue we have briefly represented the general architecture of a model 
for the processing of affective texts. We have defined, that affective processing is 
performed by specific components of the model - d-scripts. D-scripts apply to 
different types of emotional communication and define meaning shifts and selection 
of semantic components to appear in the text and, in some cases, to start a listener’s 
d-script. As we expect, the proposed model can be implemented in computer agents, 
simulating speech influence (activating d-scripts while processing “emotional” texts) 
or supporting “communication of victims” and conflict communication on the 
predefined object fields. 
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Abstract. This paper describes an initial approach to emotional speech 
synthesis in Catalan based on a diphone concatenation TTS system. 
The main goal of this work is to develop a simple prosodic model for 
expressive synthesis. This model is obtained from an emotional speech 
collection artificially generated by means of a copy-prosody experiment. 
After validating the emotional content of this collection, the model was 
automated and incorporated into our TTS system. Finally, the automatic 
speech synthesis system has been evaluated by means of a perceptual 
test, obtaining encouraging results. 



1 Introduction 

Nowadays, human-computer interaction (HCI) systems tend to incorporate both 
vision and speech because they are the natural channels of human communica- 
tion. For this reason, HCI should be bidirectional [1], since i) the machine could 
understand the user’s message using both speech recognition and computer vi- 
sion techniques [2] and ii) the machine could answer by means of audiovisual 
synthesis [3]. Moreover, the interaction would become more efficient and user- 
friendly if emotional expressions could be recognized [4] [5] and synthesized [6] . 

The present work is our first approach to automatic emotional speech synthe- 
sis in Catalan with the purpose of including emotional expressivity in the output 
channel of an HCI system [7] [8]. Catalan is the native language of Catalonia, 
the Valencian Country and the Balearic Islands (central east and north east part 
of Spain), which is spoken by more than 6 million people. Nevertheless, these 
are bilingual areas where Spanish is also spoken, being, in fact, the dominant 
language used for communication (news, TV, radio, ...). Thus, Catalan is a 
minority language in front of Spanish influence. Like other languages, Catalan 
has several varieties (Central, North-occidental, Valencian, Balearic and others) 
that are spoken in different areas of the territory. We have focused our studies 
in the Central variety which is spoken near the city of Barcelona. 

This paper is organized as follows: Section 2 presents a brief description of 
the related literature and our previous work in this field. Section 3 summarizes 
the method chosen to model and synthesize emotional speech, which is fully 
described in sections 4 and 5. Section 6 presents a discussion about the different 
topics related to the approach introduced. Finally, the last section presents the 
conclusion of this work. 
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2 Background 

Emotional speech synthesis is a complex process that involves the integration of 
different knowledge areas. On the one hand, Psychology tries to describe emo- 
tions and the corresponding human actions to express and perceive them [9]. 
According to the review presented by Bartneck [10], emotions can be described 
as discrete categories or as regions of a multidimensional space. On the other 
hand. Psychoacoustics analyzes the effect of emotions on speech, proving that 
voice suffers acoustic changes due to physiological alterations [11]. Therefore, 
these variations have to be considered in order to obtain an acoustic model of 
emotional speech. Usually, the modeling process involves choosing the most re- 
levant parameters and their behavior representation. In this sense, an emotional 
speech corpus becomes essential to define the corresponding acoustic model [12]. 
As a final point, the Speech Technology research collects this knowledge in order 
to synthesize emotional speech, incorporating the defined models and corpora. 
In [6] , a full review of different approaches to emotional speech synthesis is pre- 
sented. These approaches can be classified into three main groups: i) rule-based 
synthesis, such as HAMLET [13] and the Affect Editor [14], ii) diphone con- 
catenation synthesis [15] [16], and iii) corpus-based synthesis [17]. To date, the 
most complete study of emotional speech synthesis in Castilian Spanish has been 
described by Montero [18] [19]. 

2.1 Our Previous Work 

This work is based on the previous investigations of [20], who built an acoustic 
model for basic emotions in Castilian Spanish following these steps: 

— Generation of a database of emotional speech. The speech corpus was recor- 
ded by 8 actors (4 males and 4 females), performing a set of carrier senten- 
ces with the seven basic emotions (fear, happiness, anger, sadness, surprise, 
desire and disgust). Each text was pronounced with 3 levels of emotional in- 
tensity. Therefore, 336 different discourses were collected (2 texts x 8 actors 
X 7 emotions x 3 intensities). 

— Perceptual test. A perceptual test was carried out to choose the most re- 
presentative recordings for each emotion. Each emotional locution (with du- 
ration ranging from 20 to 40 seconds) was listened to by two groups of 30 
people. The five best ranked utterances of each emotion were selected for 
the final database, according to the highest percentage of both identification 
and level of credibility. The identification score of all the selected utteran- 
ces exceeded 90%, except for disgust, where all utterances scored less than 
50%. For this reason, disgust was not acoustically modeled and therefore the 
model was only defined for six emotions. 

— Acoustic analysis. A systematic analysis of the selected utterances was de- 
veloped in terms of fundamental frequency (mean, range and variability), 
sound pressure (mean, range and variability) and timing parameters (du- 
ration and number of silences, duration and number of phonic groups and 
syllables per second). 
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Later, this acoustic model was validated using concatenative speech synthesis 
[21], concluding that the model was valid to simulate sadness, anger and fear 
only by means of the prosodic modification. However, the results obtained for 
happiness, surprise and desire showed that this methodology was inadequate to 
achieve a sufficient level of identification of these emotions. 

3 Our Approach 

The main objective of this work is to incorporate emotional speech into our 
Catalan TTS system, with the purpose of achieving better expressiveness. 

Our approach starts from the hypothesis that the modification of the pro- 
sodic parameters of the TTS system is the first step towards emotional speech 
synthesis. To date, we only considerer melody, rhythm and energy modeling by 
means of the modification of pitch, energy and duration of the phones and the 
duration of pauses. Initially, only four basic emotions have been modeled (fear, 
happiness, anger and sadness) with the goal of exploring the possibilities of this 
method. As a result, a prosodic model for each emotion will be obtained and 
implemented. 

One significant aspect of this approach is the use of a previous developed 
resource, a Spanish database, instead of developing a new Catalan database. We 
believe that a Spanish database would suffice because two languages are very 
similar and almost all Catalan people speak perfectly both languages. At the 
begining of this work we dispose of a Spanish database perfectly validated by 
means of a rigorous perceptual test [20] . The recording and validation of a new 
database in Catalan would suppose an expensive task that we could not carry 
out. Therefore, the data generation in Catalan was performed by means of a 
copy-synthesis experiment with translated texts of the Spanish database. 



3.1 Summary of the Methodology 

Below, we present a summary of the steps followed during this work (see Figure 
1), which are detailed in sections 4 and 5: 

1. The emotional speech corpus in Spanish presented in [20] was chosen as 
the source data for the acoustic modeling. From this corpus, four different 
utterances containing the same text (one per emotion) were selected, re- 
presenting the corresponding emotion clearly. Each utterance contained 11 
sentences which are segmented into phones and labeled (pitch, energy and 
duration). The duration of the pauses between sentences was also annotated. 

2. Then, these sentences were translated into Catalan and their phonetic tran- 
scriptions were generated. Moreover, the prosody associated to each phone 
was manually adjusted from corresponding phone of the Spanish database. 

3. Next, a small data collection of emotional speech in Catalan was generated 
after TTS synthesis. 

4. A perceptual test of emotion identification was developed in order to validate 
this synthetic speech collection. 
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5. A prosodic model adapted to our TTS system was obtained by comparing 
the annotated emotional speech with the default output (neutral) when syn- 
thesizing the same text. 

6. The emotional prosodic model was incorporated to the Natural Language 
Processing module of our TTS system 

7. And finally, a new perceptual test was performed to evaluate the degree of 
emotion identification when this model was automated. 




Fig. 1. Flowchart that summarizes the seven steps followed during the definition and 
the validation of the prosodic model for emotional speech synthesis in Catalan 



4 Prosodic Modeling Oriented to Expressive Catalan 
TTS 

This section describes the process followed to obtain a prosodic model for emo- 
tional speech synthesis adapted to our TTS system in Catalan. This process 
follows three steps: firstly, the generation of the emotional speech data collec- 
tion, secondly, its validation, and, finally, the definition of the prosodic model. 
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4.1 Stimuli Generation 

The emotional speech data collection in Catalan is generated using our TTS 
system with an input of phonetic and prosodic information manually adjusted. 
The phonetic transcription was obtained by translating the original text into 
Catalan. The prosodic information was adjusted following the same patterns of 
intonation, rhythm and intensity of the utterances selected from the Spanish 
corpus. Note that we assume the Spanish prosodic pattern can be valid for 
Catalan due to the similarities between the phonetics and the prosody of both 
languages. 

As a result, we had at our disposition a synthetic male speech collection cor- 
responding to four different emotions in Catalan of the same text (see Appendix 
A). In addition, the speech collection was completed with neutral utterance syn- 
thesizing the same text by our TTS system. This neutral utterance is used as 
the reference pattern for the prosodic modeling of the considered emotions. 

4.2 Validating the Emotional Speech Data 

A perceptual test involving 10 listeners was conducted to validate the emotional 
speech generated artificially. Before the beginning of the test, neutral synthetic 
speech was played to the listener in order to become familiarized with the syn- 
thetic speech of our TTS system. Next, each subject listened to four utterances 
randomly selected in two consecutive rounds. They had to choose between fear, 
happiness, anger, sadness or uncertain emotion (forced-test) after listening to 
each utterance only once. As depicted in Figure 2, the second round presents 
a better degree of identification than the first one. The reason for this result is 
that the subjects have listened to all stimuli once (they already disposed of a 
comparison criterion among the four emotions). 



Synthetic Emotional Speech (Round 1) 



Synthetic Emotional Speech (Round 2) 




■ Fear 

□ Happiness 

□ Anger 

□ Sadness 

□ Uncertain 



Identification/confusion 
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Fear Happines Anger Sadness 



■ Fear 

□ Happiness 

□ Anger 

□ Sadness 

□ Uncertain 



identification/confusion 



Fig. 2. Identification/confusion percentages between emotions on the synthetic speech 
obtained with the prosody manually adjusted 



Moreover, Figure 2 shows that sadness is the best identified emotion in both 
rounds, with 100% of identification in the second one. Moreover, anger and fear 
improve their percentages after the first round due to the fact of establishing 
differences between emotions. Nevertheless, happiness is the only emotion that 
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presents worse results in the second round, decreasing from 40% to 30% of iden- 
tification. This experiment corroborates the general results which denote that 
happiness is a difficult emotion to simulate [16], [18] and [21]. Moreover, some li- 
steners expressed the increased difficulty of identifying happiness in an utterance 
with no positive intention. 



4.3 Generating the Prosodic Model 

The main goal of this modeling is to obtain a general behavior of the prosodic 
parameters of speech related to the four considered emotions. However, the emo- 
tional prosodic model developed in this work is simple because of the constraint 
of being incorporated in a concatenative speech synthesizer. In this approach, 
the prosodic parameters calculated from the male speech collection are grouped 
into three categories: pitch, timing and energy, which are described as follows: 



Table 1. Relative percentage of mean variation of pitch parameters with respect to 
neutral style for each emotion 



Relative mean variation 


Fear 


Happiness Anger Sadness 


Average pitch 
Pitch Range 


-f52% 

-3% 


-fl3% 

-10% 


-h33% -7% 

-t30% -60% 



Pitch. The pitch parameters describe several features of the fundamental fre- 
quency (FO). The average pitch and the pitch range of each utterance are cal- 
culated. Table 1 summarizes the percentage of the average variation of both 
parameters with respect to the corresponding values of the neutral style. Notice 
that the average pitch presents high increments in fear and anger, and a slight 
reduction in sadness. On the other hand, the pitch range presents higher va- 
riations in anger (positive) and sadness (negative). However, happiness presents 
non-representative variations of both parameters in terms of their mean values. 
Moreover, Figure 3 presents the statistical representation of the results obtained 
for both parameters in the emotional speech data collection. 



Timing. The timing parameters describe the features related to speech rate. 
In this paper, the duration of pauses and the average duration of phonic group 
are calculated for each utterance. The duration of pauses parameter represents 
the increment or the decrement of the mean duration of pauses for each emotion 
with respect to the neutral speech (see Table 2). 

The duration of phonic groups is calculated as the relative percentage of the 
mean phone durations of an emotional utterance with respect to its mean value 
in the neutral utterance (see Table 3). Note that the last phonic groups of the 
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F«r Anger Sad-^ \#.nl 

(•) 




(b) 



Fig. 3. (a) Mean and standard deviation (/r ± a) of average pitch, (b) Mean and 
standard deviation (/i ± a) of pitch range. 

Table 2. Relative percentage of mean variation of pause duration with respect to the 
neutral style 



Relative mean variation 


Fear 


Happiness Anger Sadness 


Duration of pauses 


+38% 


+3% 


-15% +128% 



sentences are treated separately because they present a different behavior. Table 
3 also shows that the speech rate is accelerated for anger and it is clearly slowed 
for sadness. 



Table 3. Relative percentage of mean variation of phone duration with respect to 
neutral style 



Relative mean variation Fear Happiness Anger Sadness 

Duration of last phonic groups +9% +0.2% -7% +25% 

Duration of the other phonic groups +6% +12% -4% +23% 



Energy. The energy parameters describe features of the amplitude of the speech 
signal. We have calculated the average energy and the energy range of each 
phrase of the corpus. The Table 4 summarizes the calculated values as an incre- 
ment or decrement of the average energy with respect to the neutral emotion 
and its energy range. All these values are expressed in decibels (dB). 

5 Automation of the Prosodic Model 

The automation of the prosodic model involves the definition of a set of rules 
describing the results presented in Section 4. These rules are defined as a modi- 
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Table 4. Relative variation of energy parameters with respect to the neutral style in 
dB 



Relative variation Fear Happiness Anger Sadness 
Average Energy -0,16 +0,29 +1,13 -1,46 

Energy Range +13 +11,1 +14,3 +10,4 



fication of the prosodic parameters generated automatically by the TTS system 
for the neutral style. As pitch and energy parameters have two degrees of free- 
dom (average and range), the adjustment of their corresponding values for each 
phone of a sentence follows different steps: 

1. From the text, the prosodic parameter values for each phone are calculated 
(neutral style), po in equations (1) and (2). 

2. Normalize po, subtracting its mean value po 

3. Adjust the normalized values to the desired range following equation (1), 
where AR is the mean range correction. 

4. The final parameter values, pf, are obtained adding the desired average 
parameter to the values obtained after step 3. In equation (2), A A is the 
mean average correction. 

p=AR-{Po-Po) ( 1 ) 

Pf = p+ AA-po (2) 

On the other hand, the duration parameter adjustment consists of multiply- 
ing the values generated by the TTS system by the mean duration correction. 
The new speech rate is obtained after applying the corresponding duration cor- 
rections to pauses and phones. As described in Section 4.3, the model defines a 
particular modification for the final phonic group. 

5.1 Evaluating the Automatic Results 

A perceptual test was conducted in order to evaluate the emotional speech gene- 
rated automatically. Ten non-expert subjects (students of Engineering) listened 
to four utterances with a different emotion synthesized from the same text in two 
consecutive rounds. The listener had to choose between fear, happiness, anger, 
sadness or uncertain (forced-test). 

Figure 4 shows the percentages of the conducted emotion identification/con- 
fusion test. Sadness is the emotion with the highest percentage of identification 
followed by fear, which presents an acceptable result. Anger is confused with 
happiness in a 30%. Happiness obtains a result only slightly higher than the 
baseline. 

6 Discussion 

After analyzing the obtained results, we wish to discuss some important topics 
related to the followed methodology and the resources employed in this work. 
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Synthetic Emotional Speech (Round 1) 



Synthetic Emotional Speech (Round 2) 




Fig. 4. Perceptual test of identification/confusion of emotion from our TTS system 
when the automated prosodic model is incorporated 



6.1 Emotion Description for Speech Synthesis Applications 

In our approach, the emotions are treated as distinct categories, and in this 
particular case, we have taken into account only four basic emotion categories 
such as anger, happiness, fear and sadness. According to [22], an alternative 
approach starts with a different representation of emotional states, as emotion 
dimensions rather than discrete categories. Emotion dimensions are a simpli- 
fied representation of the essential properties of emotions. Usually, Evaluation 
(positive/negative) and Activation (active/passive) are the most important di- 
mensions. The resulting synthesis system of this dimensional approach is by 
design highly flexible and very efficient when complementary sources of informa- 
tion, such as verbal content, the visual channel, and the situational context, are 
present in an HCI application. In this sense, authors will further pursue research 
in order to incorporate different emotional states or dimensions. 

6.2 Emotional Speech Corpus 

The development of a new database of emotional speech in any language is a 
difficult and expensive task. In [12], the main issues that need to be considered 
when developing this kind of corpus are presented. If the database is oriented 
to emotional speech synthesis, it has to be designed to involve both the acou- 
stic modeling of emotions (off-line process) and the synthesis process (on-line). 
Currently, there is a tendency to use unit selection synthesizers in order to mi- 
nimize the prosodic modification of the acoustic units at synthesis time [17]. 
Consequently, the resulting speech is natural-sounding for the specific categories 
which were recorded without a previous modeling of the acoustic properties of 
emotions. The most important limitation of this kind of approach is the enormous 
database (one corpus per emotion) involved in the speech synthesis process. 

6.3 Modeling Speech and Emotion 

One of the most critical aspects of the acoustic modeling of emotional expres- 
sions is the temporal behavior of the voice parameters, which suffer changes 
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depending on the emotional state of the talker. These changes are only present 
in certain moments of the speech message [20]. Thus, both the frequency of ap- 
parition of these changes and their time position in the discourse are key issues 
to be explored in future investigations. For instance, in this work, we have not 
obtained good results modeling and synthesizing happiness because our model 
has only been based on average values of the analyzed parameters. Therefore, 
this approach seems inadequate when the considered parameters present a high 
deviation with respect to their mean value. We believe that the right control of 
the parameter changes over the time would improve the acoustic modeling of 
this kind of emotion. 

Another important issue to be taken into account is the development of ana- 
lysis and synthesis tools to process the whole set of relevant speech parameters 
involved in emotional expressions [14]. Currently, we are working on a parame- 
terized concatenative speech synthesis system to allow a higher degree of control 
of prosodic and acoustic modification of the speech units [23] . This system is a 
hybrid method based on TD-PSOLA and the harmonic plus noise model, which 
incorporates a novel method to jointly modify pitch and time-scale. Moreover, 
it is able to control separately the energy of both the harmonic and the noise 
components of speech. 

Moreover, prosodic modeling of emotions would have to be improved to 
achieve a major level of credibility for some emotions. On the one hand, the 
method for modeling duration should consider the differential elasticity of the 
different phoneme classes instead of stretching uniformly all synthesized phone- 
mes [24] . On the other hand, there are problems associated with energy modeling 
that we have to solve. For instance, the change in voice quality parameters rela- 
ted to vocal effort. 

7 Conclusion 

This work is our first approach to emotional speech synthesis in Catalan by 
means of modeling average variation of prosodic parameters with respect to the 
neutral prosody generated by our TTS system. An emotional speech collection 
in Catalan has been generated artificially and it has been validated via a per- 
ceptual test. The analysis of this speech collection has resulting in the definition 
of a model that converts neutral prosody into emotional prosody automatically. 
A second perceptual test has been performed in order to evaluate the identi- 
fication/confusion percentages of the automatic system obtaining encouraging 
results. 
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Appendix 

Initial text in Spanish: 

”La casa aparecio al final del camino. Empezaba a ponerse el sol, pero la 
fachada del edificio aun se vela con claridad. Unas figuras pasaban por detras de 
las ventanas del piso superior. Me acerque poco a poco. Nadie me vio, nadie me 
esperaba, nadie me recibio, entre sin hacer ruido. Subi las escaleras con agilidad. 
Las voces me guiaron hasta la gran habitacion y lo vi todo.” 

Translated text into Catalan: 

”La casa aparegue al final del camf. Comengava la posta de sol pero, la fagana 
de I’edifici encara es veia amb claredat. Unes figures passaven per darrera de les 
finestres del pis superior. Em vaig apropar a poc a poc, ningu em veie, ningii 
m’esperava, ningii em rebe. Vaig entrar sense fer soroll. Vaig pujar les escales 
amb agilitat. Les veus em guiaren fins a la gran habitacio i ho vaig veure tot.” 
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Abstract. Past attempts to model emotions for speech synthesis have 
focused on extreme, “basic” emotion categories. The present paper sug- 
gests an alternative representation of emotional states, by means of emo- 
tion dimensions, and explains how this approach can contribute to ma- 
king speech synthesis a useful component of affective dialogue systems. 



1 Introduction: The Problem 

In most studies exploring the expression of emotions in speech synthesis [1, 
2], a number of between three and nine discrete, extreme emotional states are 
modelled. However, the often implicit assumption that the expression of a few 
basic or primary emotion categories is most important to model, and that other 
emotional states can somehow be derived from that, has been questioned by 
Cowie [3] . He argued that systems should be able to express less intense emotions 
more suitable for real life applications. 

There are a number of alternative choices available for representing emotions. 
While well established in the psychological literature, these are not very well 
known in the intelligent user interfaces community. Instead, representations are 
used that may be suboptimal for a given task, as illustrated by the frequent use 
of the Darwinian concept of “basic emotions” in emotional speech synthesis [2]. 

Instead of proposing a single “best” emotion representation, I argue that 
different components of an affective dialogue system may best be using different 
emotion representations optimised for their respective tasks. In the following, 
a selection of such representations is presented. Their suitability for different 
aspects of affective dialogue systems is discussed, and the question of mapping 
between these representations is raised. 

The use of emotion dimensions for speech synthesis is then demonstrated in 
more detail. 

2 Types of Emotion Representation 

2.1 Emotion Categories 

The most straightforward description of emotions is the use of emotion-denoting 
words, or category labels. Human languages have proven to be extremely power- 
ful in producing labels for emotional states: Lists of emotion-denoting adjectives 
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exist that include at least 107 English [4] and 235 German [5] items. Several 
approaches exist for reducing these to an essential core set. 

Basic emotions. Especially in the Darwinian tradition of emotion research [6], 
there is general agreement that some fullblown emotions are more basic than 
others. From this point of view, the basic emotions correspond to specific, evo- 
lutionarily shaped functions benefiting survival. They constitute highly specific, 
intense, multi-faceted syndromes consisting of physiological changes, action ten- 
dencies, expressive behaviour etc. In databases of spontaneous emotional speech, 
these basic emotions are rarely found [7]. 

Superordinate emotion categories. Alternatively, emotion categories have been 
proposed as more fundamental than others on the grounds that they include the 
others. An example may clarify the idea: [8] proposed five prototypes underlying 
all emotion categories: Anger, love, joy, fear, and sadness. Joy, for example, would 
be subdivided into pride, contentment, and zest. [9] gives a short overview of 
recent proposals of such lists. 

Essential everyday emotion terms. A pragmatic approach is to ask for the emo- 
tion terms that play an important role in everyday life. The approach is exem- 
plified by the work of [10], who proposed a Basic English Emotion Vocabulary. 
Starting from lists of emotion terms from the literature, subjects were asked to 
select a subset which appropriately represents the emotions relevant in everyday 
life. A subset of 16 emotion terms emerged. 

This short list of options shows that even if one decides to model emotions 
in terms of categories, it is not immediately clear what categories to use. The 
most frequently used categories may not be the most suitable ones for a given 
research question or application. 

2.2 Appraisal-Based Descriptions 

In cognitive emotion theories, the central concept is appraisal - an evaluation 
of a stimulus through relatively low-level, automatic cognitive processes. The 
appraisal of a stimulus determines the significance of the stimulus for the indi- 
vidual, and triggers an emotion as an appropriate response. 

Details about how and according to which criteria the perceived stimuli are 
evaluated and which reactions are triggered have been worked out by a number 
of researchers. The most notable for a speech and emotion researcher is the 
component process model developed by Scherer [11], from which Scherer has 
made detailed physiological predictions about the vocal changes associated with 
certain emotions [12], which in their large majority were verified experimentally 

[13] . 

Another cognitive emotion model, detailing the presumed appraisal structure 
leading to the multitude of emotions, was proposed by Ortony, Clore and Collins 

[14] . In this so-called OCC model, emotions are seen as valenced reactions to 
three types of stimuli: Events, agents, and objects. The model is formulated in a 
way permitting its implementation in AI systems. Several conversational agent 
systems have adopted the model, in a so-called “affective reasoning” module [15]. 




Dimensional Emotion Representation as a Basis for Speech Synthesis 



211 



2.3 Emotion Dimensions 

Many different approaches reported in the psychological literature have led to the 
proposal of dimensions underlying emotional concepts (see [1] for an overview). 
Through multidimensional scaling, semantic differential and other techniques, 
different researchers came to propose three dimensions. They are gradual in 
nature and represent the essential aspects of emotion concepts (how good or bad, 
how aroused or relaxed, how powerful or weak) rather than the fine specifications 
of individual emotion categories. 

It is important to know that the names used for these dimensions were ac- 
tually selected by the individual researchers interpreting their data, and did not 
arise from the data itself. That explains the relative arbitrarity in naming the 
dimensions that can be found throughout the literature (Mehrabian and Russell 
call them pleasure, arousal and dominance [16], Osgood et al. use the names 
evaluation, activity and potency [17], Cowie et al. use evaluation, activation and 
power [18]). In this paper, I will use the terms evaluation (synonymous to va- 
lence or pleasure), activation (used as synonymous to arousal and activity) and 
power (potency or dominance). 



2.4 Suitability of a Representation for a Task 

It is important to think carefully about the type of representation most suitable 
for a given task. 

The planning component of a dialogue system needs to assess the situation 
and interpret the meaning of items in order to assess how to act and respond. 
This is the natural domain for appraisal-based descriptions, which can map 
situation appraisals to emotions. 

In a dialogue, an emotional state may build up rather gradually, and may 
change over time as the interaction moves on. Consequently, a speech synthesis 
system should be able to gradually modify the voice in a series of steps to- 
wards an emotional state. In addition, it seems reasonable to assume that most 
human-machine dialogues will require the machine to express only mild, non- 
extreme emotional states. Therefore, the need to express fullblown emotions is a 
marginal rather than a central requirement, while the main focus should be on 
the system’s capability to express a large variety of emotional states of low to 
medium intensity. Emotion dimensions are a representation of emotional states 
which fulfils these requirements: They are naturally gradual, and are capable 
of representing low-intensity as well as high-intensity states. While they do not 
define the exact properties of an emotional state in the same amount of detail 
as a category label, they do capture the essential aspects of the emotional state. 

Facial expressions, on the other hand, appear to be best generated using 
categorical descriptions of emotion. However, there is a need not only to vary 
the expressions in intensity, but also to blend them. Emotion dimensions can 
help in determining the similarity of emotion categories and therefore find the 
right “mixture” of facial attributes for intermediate states [19]. 
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2.5 Mappings between Emotion Representations 

It is an open research question how best to map between the different emo- 
tion representations of which only some have been listed above. As the different 
representations have been created for different purposes and capture different 
subsets of emotion-related phenomena, simple one-to-one mappings are not al- 
ways possible. 

Emotion categories can be located in emotion dimension space via rating 
tests [10]. The mapping from categories to dimensions is therefore a simple 
task, as long as the coordinates of the emotion category have been determi- 
ned. The inverse, however, is not possible: As emotion dimensions only capture 
the most essential aspects of an emotion concept, they provide an underspecified 
description of an emotional state. For example, the coordinates for anger and 
disgust may be very close, because the two categories share the same activa- 
tion/evaluation/power properties. The features distinguishing between the two 
categories cannot be represented using emotion dimensions, so that the corre- 
sponding region in space can only be mapped to “anger-or-disgust” rather than 
a specific category. 

In appraisal-based descriptions, the link to and from categories is also pos- 
sible. A given combination of appraisal outcomes corresponds to an emotion 
category. 

The mapping from an appraisal-based representation to a dimensional repre- 
sentation becomes possible via the intermediate representation as emotion ca- 
tegories. This approach is not without pitfalls, however: Emotion categories are 
often represented by simple words, such as “anger”, which are highly ambiguous. 
If the interpretation of the word “anger” in the context of its constituting ap- 
praisals differs from its interpretation when locating it on emotion dimensions, 
then the mapping becomes inaccurate. Disambiguating the emotion category, 
on the other hand, implies a fuller description of the emotion, which ultimately 
would require an all-encompassing emotion representation. 

It should be kept in mind, therefore, that mappings between the currently 
existing emotion representations are necessarily imperfect. 

A first attempt into this direction was nevertheless attempted in the NECA 
project [20]. An affective reasoning component working with the OCC model 
is used for determining the appropriate emotion in a given dialogue situation, 
represented as a combination of emotion category and intensity. This represen- 
tation is mapped onto emotion dimensions, using the intensity value to linearly 
interpolate between the neutral state and the coordinates of the fully developed 
emotional state in the dimensional space. The speech synthesis component uses 
the dimensional representation to generate emotional speech as described below. 
Finally, the animation component uses the category and intensity representation 
for generating facial expressions. 



3 Application to Speech Synthesis 

Emotions can be expressed in speech synthesis either by using emotional speech 
databases as concatenation material or by formulating explicit prosody rules on 
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the link between emotional states and their effects on speech prosody [2]. The 
work presented here follows to the latter approach. 



3.1 Formulation of Emotional Prosody Rules 

Emotional prosody rules were formulated on the basis of a literature review and 
a database analysis. 

A literature review [1] brought about the following results. An unambiguous 
agreement exists concerning the link between the activation dimension and the 
most frequently measured acoustic parameters: Activation is positively correla- 
ted with mean FO, mean intensity, and, in most cases, with speech rate. Additio- 
nal parameters positively correlated with activation are pitch range, “blaring” 
timbre, high-frequency energy, late intensity peaks, intensity increase during a 
“sense unit” , and the slope of FO rises between syllable maxima. Higher activa- 
tion also corresponds to shorter pauses and shorter inter-pause and inter-breath 
stretches. 

The evidence for evaluation and power is less stable. There seems to be a 
tendency that studies which take only a small number of acoustic parameters 
into account do not find any acoustic correlates of evaluation and/or power. 

The limited evidence regarding the vocal correlates of power indicates that 
power is basically recognised from the same parameter settings as activation 
(high tempo, high FO, more high-frequency energy, short or few pauses, large 
intensity range, steep FO slope), except that sometimes, high power is correlated 
with lower FO instead of higher FO, and power is correlated with vowel duration. 

There is even less evidence regarding the acoustic correlates of evaluation. 
Positive evaluation seems to correspond to a faster speaking rate, less high- 
frequency energy, low pitch and large pitch range; a “warm” voice quality; and 
longer vowel durations and the absense of intensity increase within a “sense 
unit” . 

In a statistical analysis of the Belfast Naturalistic Emotion Database [7], per- 
ceptual ratings of the emotion dimensions activation, evaluation and power were 
correlated with acoustic measures (see [1,21] for details). The study replicated the 
basic patterns of correlations between emotion dimensions and acoustic varia- 
bles. It was shown that the acoustic correlates of the activation dimension were 
highly stable, while correlates of evaluation and power were smaller in number 
and magnitude and showed a high variability between male and female speakers. 
In addition, the analysis provided numerical linear regression coefficients which 
were used as a starting point for the formulation of quantified emotion prosody 
rules. 

The effects found in the literature and in the database analysis were formu- 
lated in a way suitable for implementation in a speech synthesis system. 

Table 1 presents the essential data required to express emotions in a speech 
synthesis system using emotion dimensions. The columns represent the emotion 
dimensions, while the rows list all the acoustic parameters for which emotion 
effects are modelled. 

The numeric data fields represent the linear coefficients quantifying the effect 
of the given emotion dimension on the acoustic parameter, i.e. the change from 
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Table 1. Implementable emotion dimension prosody rules. Values on emotion dimen- 
sions range from -100 to 100, with 0 being the “neutral” value. The percentage values 
are factors - see text for details. 
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the neutral default value. As an example, the value 0.5% linking Activation to 
rate means that for an activation level of -1-50, rate increases by -1-25%, while 
for an activation level of -30, rate decreases by -15%. 

3.2 Implementation: EmoSpeak 

The German text-to-speech system Mary (Modular Architecture for Research 
on speech sYnthesis [22]) was used as the platform for the implementation of 
the emotional prosody rules specified in Table 1 . This system was most suitable 
for the task because of the high degree of flexibility and control over the various 
processing steps, which arises from the use of the system-internal representation 
language Mary XML. 

A major design feature in the technical realisation of the emotional speech 
synthesis system was that the acoustic effects of emotions should be specified in 
one single module. This module adds appropriate Mary XML annotations to the 
text which are then realised by the respective modules within the Mary system. 
As a consequence, all of the parameters are global in the sense that they will be 
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applied to all enclosed text. This approach is considered the most transparent, 
as the link between emotions and their acoustic realisations is not hidden in 
various processing components, and the easiest to maintain and adapt, as all 
rules are contained in one document. 

A simple emotion dimension markup language (EDML) was created, an- 
notating text using a single <emotion> tag in which the positions on emotion 
dimensions are specified as the values of the activation, evaluation and power 
attributes. An example EDML document is the following: 

<?xml version=" 1 . 0" encoding="IS0-8859-l"?> 

<emotion activation="+30" evaluation="+70" power="+20"> 

Wie wunderbar ! 

</emotion> 

For transforming the emotion markup language into the MARY-internal 
Mary XML, the emotion realisation module uses an XSLT stylesheet implemen- 
ting the rules listed in Table 1. Applying that XML transformation to the above 
example creates the following Mary XML document: 

<?xml version=" 1 . 0" encoding="IS0-8859-l"?> 

<maryxml xmlns="http : //mary . df ki . de/2002/MaryXML" 

xmlns : xsi="http : / /www . w3 . org/2001/XMLSchema-instance " 
version="0 . 3" xml ; lang="de"> 

<voice name="de6"> 

<prosody accent -prominence="-20y„" accent-slope="-57," 
fricative-duration="-67„" liquid-duration="+277." 
nasal-duration="+277„" number-of-pauses="+2l7„" 
pause-duration="-67," pitch="124" pitch-dynamics="-127o" 
plosive-duration="-67o" preferred-accent-shape=" alternating" 
pr ef err ed-boundary-type= " low " r ange= " 37 " r ange-dynami c s= " +47« " 
rate="+297„" volume="60" vowel-duration="+277,"> 

Wie wunderbar ! 

</prosody> 

</voice> 

</maryxml> 

Note that the voice is set to de6, one of the two voice databases with three 
activation-related voice qualities created in the NECA project [23]. The attri- 
butes of the prosody tag, though listed here in alphabetical order rather than 
grouped by their meaning, correspond exactly to the rows of Table 1. As motiva- 
ted above, the MaryXML document retains no information about the emotional 
state, but all information required for the desired acoustic realisation of that 
emotional state. Like any other MaryXML document, the above example can be 
processed by the standard Mary modules. 

A graphical user interface was programmed to allow for an interactive explo- 
ration of the emotional speech synthesis system. 

The interface, shown in Figure 1, allows the user to type in any text, and to 
specify the emotional state with which the text is to be spoken. The position on 
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Fig. 1. The EmoSpeak interface to emotional speech synthesis. The system is available 
online under http : / /mary . df ki . de/emotional . html 



the activation and evaluation dimensions is specified simultaneously, by locating 
a green cursor in a two-dimensional space modelled after the Feeltrace circle 
[24]. The third dimension, power, is set independently. Using these positions 
on the three emotion dimensions as well as the text to be spoken, an EDML 
emotion markup document is created. The emotion-to-maryxml transformation 
stylesheet, described above, transforms this into a Mary XML document, which 
is displayed for the interested user in a second window under the EmoSpeak 
interface (not shown). Simultaneously, the MaryXML document is sent to the 
Mary server, which synthesises a corresponding audio file and sends it back to 
the interface. All of these transformations are carried out continuously as the 
user modifies the emotional state or the text. By clicking on the “Play” button, 
the user can hear the result. 



3.3 System Evaluation 

The appropriateness of the generated emotional prosody and voice quality was 
assessed in a perception test. Due to the multimodal nature of any emotional 
utterance, this appropriateness is to be thought of in terms of coherence with 
other channels expressing the emotion, such as verbal content, possibly visual 
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channels, and the situational context. For applications, it is necessary for the 
speech prosody to “fit with” the rest of the message, in order not to put off the 
users. This question seems more important than the question what emotional 
information can be conveyed by the speech prosody alone. 

Consequently, any perception test methodology aiming at the assessment of 
the perceived emotional state is suboptimal for the current task. This includes 
the identification task methodology widely used for the evaluation of synthesised 
emotional speech [2], as well as methods for describing the perceived emotional 
state in terms of emotion dimensions, such as semantic differential ratings [17] 
or the Feeltrace tool [24]. Instead, a preference task methodology, using coherent 
and contradictory multi-channel emotional messages as stimuli, was explored as 
a promising alternative. 

In order to create reference material with a known emotional connotation, 
36 textual descriptions of emotional situations were extracted from the Belfast 
Naturalistic Emotion Database [7], translated to German and presented in a 
written rating task. 15 subjects (mostly students; 4 male, 11 female) rated each 
situation description on the evaluation and activation dimensions. Eight situa- 
tion descriptions with high inter-rater agreement (i.e., small standard deviations 
in activation / evaluation ratings) were retained as relatively unambiguous emo- 
tion identifiers. They were well distributed across the activation-evaluation space 
and included extreme as well as moderate emotional states, see Fig. 2.^ 




Fig. 2. Co-ordinates of selected si- 
tuation descriptions in activation- 
evaluation space. Error bars show one 
standard deviation in each direction. 




Fig. 3. Example of a result plot: Com- 
bination of prosody B with all texts. 
Pillar height represents goodness-of-fit 
ratings. 



For each of the emotional states defined by the situation descriptions, emotio- 
nal speech prosody settings were calculated, and each of the texts was synthesised 
with each of the prosodic settings in a factorial design, resulting in 64 stimuli. 

^ The situation descriptions are summarised as follows. A: matter-of-fact description 
of a painting’s history; B: he complains to his ex-wife about wanting to speak to 
the common son more often; C: he accuses his sister of having lied to him; D: he 
remembers how he first saw his future wife; E: he is very relieved about not being 
thrown out of his flat today; F: he talks about vain attempts of a psychiatrist to 
help his wife; G: he is convinced everybody hates him so he stopped even trying to 
speak to people; H: a hiker describes the enjoyable peace in the mountains. 
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In a listening test, 20 subjects rated each stimulus according to the question: 
“How well does the sound of the voice fit with the content of the text?” 

The results confirmed the hypothesis that the prosodic configurations succeed 
best at conveying the activation dimension. Moreover, the appropriateness of a 
prosodic configuration for a given emotional state was shown to depend on the 
degree of similiarity between the emotional state intended to be expressed by 
the prosody and that in the textual situation description: There is a highly 
significant negative correlation between the difference in activation between text 
and prosody on the one hand and the goodness-of-fit ratings on the other (partial 
correlation controlling for difference in evaluation: r = —.331, one-tailed p < 
.001). In other words, the more different the activation expressed in the text 
was from that expressed through the prosody, the lower the ratings for their 
combination (the examplary result plot in Fig. 3 also illustrates this). 

In agreement with previous findings for human speech, the evaluation dimen- 
sion was found to be more difficult to convey through the prosody. Only a very 
small partial correlation (r = —.079, one-tailed p = .002) was found between 
the distance between text and prosody on the evaluation dimension and the 
goodness-of-fit ratings. 

In summary, the speech synthesis system succeeded in expressing the activa- 
tion dimension (the speaker “arousal”), but not the evaluation dimension. See 
[1] for a more detailed account of the experiment. 

4 Conclusion 

This paper has raised the issue of emotion representation in the context of af- 
fective dialogue systems. Selected available representations were described and 
considered in relation to their suitability for various processing components. The 
possibilities and limitations of mapping between these representations were also 
discussed. 

In more depth, the paper has investigated the principled use of emotion 
dimensions in the context of speech synthesis, this formalism being well suited 
for the description of gradual, non-extreme emotional states. 

A set of rules for mapping emotion dimensions onto prosodic settings was 
compiled from the literature and from a database analysis, and implemented in 
a German text-to-speech system. A perception test confirmed that the system 
succeeded in expressing the speaker activation or arousal. In line with the li- 
terature on human speech, the expression of the evaluation dimension did not 
succeed. It is an open question whether this is a principal restriction of the 
speech channel or whether modelling of additional prosodic parameters, as those 
linked to smiling [25], would improve perception. 
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Abstract. The Expressive Speech Processing project has been collec- 
ting natural conversational speech from a number of ordinary people as 
they go about their daily lives for almost four years now. As a result, 
we have a better idea of the types of information that are signalled by 
interactive speech, and propose a framework within which the intended 
interpretation of an utterance can be specified for dialogue speech syn- 
thesis incorporating affective information. We have found that a very 
large proportion of speech utterances simultaneously convey non-lexical 
interpersonal and discourse-related information, and propose a model by 
which such extra-semantic protocols may be incorporated. 



1 Introduction 

Human speech carries many layers of information simultaneously. In addition to 
linguistic information it signals speaker-specific information, such as age, sex, 
personality, health, mood, and emotion, as well as discourse information gover- 
ning role-taking and turn-taking aspects of the dialogue, and affective informa- 
tion that provides clues to the intended interpretation of a given utterance. Each 
layer is perceived independently in the human communication process, but only 
the linguistic layer is adequately modelled in current computer speech processing 
systems. Speech is reduced to text. 

There has long been an implicit assumption in computer speech processing 
that a text transcription adequately embodies all the relevant parts of a spoken 
message, but this is now being challenged. In particular, the affective colouration 
of a message is now being considered as an essential component which must be 
encoded with the text for a successful interpretation of the speaker’s intended 
message. 

This paper proposes a framework for the synthesis of dialogue speech that 
incorporates affective information. It discusses some of the limitations of pre- 
sent computer speech synthesis and presents findings from an analysis of in- 
teractive conversational speech, showing that there is a particular need for the 
processing of non-verbal speech sounds in human-computer interactions. These 
extra-semantic protocols facilitate the exchange of discourse-related information, 
and signal speaker-state, speaker-hearer relationships, and pragmatic cues to the 
intended interpretation of an utterance. 
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2 Interactive Speech 

The Expressive Speech Processing project [1] is a five-year multi-party research 
effort to improve speech technology for an advanced media society by raising the 
baseline of current speech processing so that the daily speech of ordinary people 
can be included. To enable this technology, we have been collecting natural con- 
versational speech from a number of volunteers as they go about their daily lives 
[2]. Conversational speech data has been collected using high-quality portable 
sound recorders in everyday situations for almost four years. There have been no 
constraints imposed on the content or style of the speech, and every effort has 
been made to obtain natural samples from everyday situations in order to better 
understand the scope and variety of speaking styles in spoken interactions. As a 
result, we now have a better idea of the types of information that are signalled 
by interactive speech. 

In particular, we have found that a very large proportion of speech utteran- 
ces in interactive discourse are used to convey non-lexical, interpersonal, and 
speaker-state information rather than propositional content [3]. Consisting lar- 
gely of ‘grunts’ and interjections, these speech sounds are rarely found in a dic- 
tionary, or incorporated in a formal grammar of the written language. They can 
have many different interpretations depending on the way in which they are spo- 
ken [4]. Accordingly, we propose a framework within which such ‘extra-semantic’ 
speech protocols may be modelled. 

By way of illustration, consider the following hypothetical dialogue between 
a human and a robot, not yet possible but representative of the type perhaps 
being planned for in present human-computer dialogue systems. The first part 
(on the left of figure 1) illustrates the type of semantically-heavy dialogue that 
predominates in systems currently considered by linguists and engineers; the 
second part (on the right of figure 1) is its equivalent, exemplifying the types 
of dialogue encountered in the ESP corpus. Both are stereotypes, but the diffe- 
rence illustrates the brevity of a human-human interaction, which places more 
emphasis on tone-of-voice and prosodic inflections, in contrast to the lexical load 
of constructed dialogues. 



Excuse me Oi! 

Yes, can I help you? Uh? 

Errm, I’d like you to come here and take a look at this ... Here! 

Certainly, wait a minute please. Oh ... 

Can you see the problem? Okay? 

No, I’m afraid I don’t understand what you are trying to show me. Eh? 

But look at this, here ... Look! 

Oh yes, I see what you mean! Ah! 



Fig. 1. Example dialogues between a human and a cooperative robot - the left part 
illustrates the current (text-based) view; the right part shows its real-life equivalent, 
consisting largely of ‘grunts’, but relying very much on intonation and tone-of-voice 
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3 Extra-Semantic Protocols 

Table 1 presents some figures from the ESP corpus to support the claim inhe- 
rent in figure 1 that shorter dialogues are more natural. From about 150,000 
transcribed conversational utterances from one speaker over a period of four 
years, we find that almost 50% are non-lexical; i.e., they could not be under- 
stood from a transcription of their text alone. Very few of the non-lexical (or 
extra-semantic) terms in these utterances can be found as an entry in a dictio- 
nary. They take their meaning from their prosody and are understood in the 
context of the discourse. However, it has been confirmed that even without di- 
scourse context information, the intended meaning of many of these utterances 
can be perceived consistently by listeners even when presented in isolation. In 
many cases, the intentions underlying the utterances can be appropriately and 
consistently paraphrased even by listeners of completely different cultural and 
linguistic backgrounds [4]. 



Table 1. Counts of non-verbal utterances in the transcriptions for one speaker in 
the ESP corpus. Transcribers are encouraged to break utterances into their smallest 
components (one-per-line) while maintaining sense groups intact. Utterances labelled 
‘non-lexical’ consist mainly of sound sequences and combinations not found in the 
dictionary, but may include common words such as “yeah” “oh” , “uhuh” , etc. 



number of utterances transcribed 


148772 


number of unique lexical utterances 


75242 


number of non-lexical utterances 


73480 


number of non-lexical utterance types 


4492 


proportion of non-lexical utterances 


49.4% 



It is clear from Table 1 that speakers and listeners must share a protocol 
for the communication of affective information that can be interpreted in place 
of, or in line with, the more well-formed semantic utterances that are produced 
for the communication of propositional content. That the listener can interpret 
a grunt (for lack of a better term - since the non-lexical utterances are not all 
well described as interjections) in ways that the speaker apparently intended im- 
plies that the current assumption of communication being signalled by semantic 
elements functioning within a syntactic framework (as portrayed by a linguistic 
transcription of the speech content) is inadequate for spoken language. Yet all 
speech technology and language processing systems are still based largely upon 
a textual representation of the speech. 

Linguistic representations do not sufficiently represent the paralinguistic in- 
formation in speech any more than they can represent the extra-linguistic infor- 
mation such as voice and personality of the speaker. A textual representation of a 
spoken utterance is therefore a poor indicator of both its content and its functio- 
nal intent. It appears from Table 1 that in human interactions, almost half of the 
utterances are used for displaying affect rather than propositional content. For 
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prescribing speech in a dialogue system, we therefore need a multi-dimensional 
representation instead. 

This paper suggests that that reresentation should consist of at least three 
levels; one to represent the state of the speaker, one to represent the relationships 
between the speaker and the listener at the time of the discourse, and one to 
represent the pragmatic force of the utterance in the context of the dialogue. In 
previous work, we have proposed an SOE framework (for Self, Other, and Event) 
to describe these three levels from the point of view of utterance description. 
Here, we extend that framework to show how it can be used to specify affective 
information for synthesis in a dialogue system. 



4 Speech Synthesis 

Computer speech synthesis can already be considered as a mature techno- 
logy. Many applications announcing e.g., weather-forecasts, share-prices, train- 
timetables, car-navigation directions, etc., regularly employ speech synthesis to 
convey information stored in a computer database. The quality of computer 
speech has now reached a level at which it can be used in human-computer 
dialogues (such as customer-care applications) and even in human-human dia- 
logues (such as speech translation applications) with the output speech being 
sometimes imperceptibly different from that of a human, and with recognisable 
personality characteristics. 

The early generations of speech synthesisers employed algorithms to predict 
the acoustic characteristics of the speech waveforms, and succeeded in mimicking 
the phonetic properties of speech to the extent that the message was completely 
intelligible, although not necessarily recognisable as a human voice. Later gene- 
rations employed recordings of actual speech signals as the source for the output 
waveforms, concatenating small (typically phone- or diphone-sized) segments 
and modifying their prosody to match the requirements of the desired output 
speech. Because of damage caused by the signal processing, the naturalness of 
the speech was reduced, although its intelligibility was improved. More recently, 
speech synthesis systems have made use of very large corpora of actual speech, 
selecting segments for concatenation that embody both the phonetic and the 
prosodic characteristics required. In this way, the original speaker characteri- 
stics are preserved and the speech is not just meaningful but also recognisable 
as an identifiable human voice. However, most such systems are currently still 
limited to a single speaking style, as they typically use studio recordings of ca- 
refully read speech, often from trained professional speakers, as the source of 
their waveforms. While adequate for announcements, these source-speech data- 
bases include little of the variation encountered in conversational speech, and 
synthesisers using them are not yet capable of reproducing expressive conversa- 
tional speaking styles. 

We anticipate that future generations of speech synthesisers will continue to 
extend the trend towards improved realism and will evolve to incorporate the 
expressivity of interactive human speech. To do this, they will need to control 
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the tone-of-voice as well as the prosody of an utterance [5] and to be capable 
of specifying the variation in speaking-style parameters that is required for the 
communication and disambiguation of not just syntactic and semantic content, 
but also extra-semantic affective information. 

By using very large corpora of natural speech as a source for the waveform 
segments, we can conceive that it might be possible for a synthesiser to express 
the same types of information and in the same ways that a human speaker does 
in normal everyday comunication. However, there is not yet a clear protocol for 
describing the extra-semantic information necessary for such afffective speech 
synthesis. 



5 Input Requirements for Synthesis 

For the specification of a dialogue utterance for synthesis, we first need to de- 
termine whether that utterance is intended primarily to convey information (I- 
type) or to display affect (A-type). The former type of utterance can probably 
be adequately specified by its textual representation alone. The latter is more 
text-independent, and to predict how it should be realised, we need to know 
about the speaker listener relationships (both short-term and long-term), the 
speaker-state (with respect to (a) emotions, mood, health, and state-of-mind, 
and (b) current interest and involvement in the dialogue), and thirdly, the in- 
tended effect or pragmatic force of the utterance. Note that ‘emotion’, which is 
a commonly-used term in the current speech-technology literature, is relegated 
to a subcategory rather than a dimension in its own right. 

An utterance whose primary function is to display affect will be either of a 
non-lexical type (short repeated monosyllables, such as “yeah-yeah-yeah-yeah- 
yeah”, or “uhuh, uhuhuh”) or a common phrase. For ease of implementation 
the Self and Other features can be simplified to a scale of four values each (e.g., 
plus/minus friend & friendly) but not the Event feature which is used phatically, 
not for its lexical meaning but for display of speaker-listener relations (like “Nice 
day, isnt it!” or “Good morning”, “Sleep well?”, etc.). In order to synthesise these 
A-type utterances, we need to know who is talking to whom, where, and why. 
This information can be coded in higher-level terms as a combination of the 
following three features or ‘SOE’ constraints: (i) Self, (ii) Other, (iii) Event, as 
in (1), which defines an utterance (U) as specified by the pair self (S) and other 
(O) given event (E): 



U={S,0)\E (1) 

where the feature Self can take different values (representing strong and weak 
settings with respect to the dimensions mood and interest respectively) and the 
feature Other can also take different values (representing strong and weak settings 
with respect to the dimensions friend and friendly respectively), and the feature 
event represents a speech act (in a wider and more detailed sense than Searle [6] 
defined) or a discourse move. 
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The feature Self refers to (a) the state of the speaker and (b) his or her 
interest in the content of the utterance. For example, a healthy, happy, person 
is likely to speak more actively than an unhealthy or miserable one. One who is 
interested in the topic or highly motivated by the discourse is likely to be more 
expressive than otherwise. 

The feature Other refers to (a) the relationships between speaker and hearer, 
and (b) the constraints imposed by the discourse context. A speaker talking with 
a friend is likely to be more relaxed than when talking with a stranger, but will 
also probably be more relaxed when talking informally, e.g., in a pub, than when 
talking formally, e.g., in a lecture hall. 

For ease of implementation in speech synthesis, both the Self and Other 
features can be simplified to a scale of four values each (e.g., plus/minus friend & 
friendly) but the Event feature remains open-ended and depends on the number 
of categories used in any particular dialogue system. For the ESP transcriptions, 
we have determined the list in Appendix 1 to be optimal. Each utterance is first 
categorised in terms of its directionality, then in terms of modality, i.e., whether 
primarily of I-type or of A- type, as in Table 2, and then for affective subcategory 
if relevant. All A-type utterances, whether lexical or grunts are candidates for 
an affect label. 

Table 2. Basic utterance types for the Event category (see Appendix for a full list) 





seeking 


offering 


I-type 


interrogative 


declarative 


A-type 


back-channel 


exclamative 



6 Discussion 

Concatenative speech synthesis using the ESP corpus as a source of speech units 
offers a new view of the challenges of speech synthesis. Rather than have to 
search for an optimal sequence of segments to make up a word, we find that 
most words (and indeed many phrases) can be found not just whole in the 
corpus, but also in many different contexts and speaking styles. Rather, the task 
is now to select one from the many likely candidates that best matches the SOE 
constraints so that an utterance appropriate to the situation can be generated 
(in this case, replayed, rather than synthesised). Since we already have four 
years worth of conversational speech from this speaker, it is highly likely that 
any A-type utterance that she will need in the fifth year can be found at least 
once among the existing data. This is probably true also for many of the I-type 
utterances, but we note that many of the unique utterances are particularly 
location- or date-specific. 

We are currently testing the concept of affective speech synthesis for a daily- 
life conversational dialogue system. Since the utterances can be replayed intact 
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from the speech corpus, the focus of evaluation for the ‘synthesis’ is no longer 
on naturalness or intelligibility, but rather on ‘appropriateness’. An interesting 
corollary which is still being evaluated is that lexical specification can be loosened 
in such a synthesis system. An expression of affect can be realised in many 
equivalent ways and the exact lexical terms may in fact be almost irrelevant so 
long as the intended display is realised and the SOE constraints met. 

7 Conclusion 

This paper has presented a framework for the description of speech suitable for 
input in an affective dialogue speech synthesiser. It has described a classification 
of utterances into two types: I-type being primarily information-bearing, and 
A- type serving primarily to express affect. It has further described a set of SOE 
constraints that govern the circumstances under which the realisation of an 
utterance may vary. Based on our analysis of the ESP corpus, we believe that 
these descriptors provide an optimal framework for the classification of affect in 
speech utterances, and we are planning further work to determine the extent to 
which utterances can be substituted within the above constraints without loss of 
impact or change of pragmatic effect. This work is a step towards the definition 
of protocols for the transmission of extra-semantic information in speech. 
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Appendix: Dialogue Act Labels Used in the ESP Corpus 



Directionality 


Category 


Dialogue Act 


(Response* subcats) 




Questions 


Question 








Y /N Question 


agree 






Repetition Request 


understand 






Request 


convinced 




Opinions 


Opinion 


accept 






Compliment 


interested 






Desire 


not convinced 






Will 


uncertain 






Thanks 


negative 






Apology 


repeat 




Negative Opinion 


Objection 


self-convinced 






Complaint 


notice 




Advice 


Advice 


thinking 


(offering) 




Command 


unexpected 


(or) 




Suggestion 


surprise 


(seeking) 




Offer 


doubt 






Inducement 


impressed 




Information 


Give Information 


sympathy 






Reading 


compassion 






Introduce Self 


other 






Introduce Topic 


exclamation 






Glosing 


listening 




Greetings 


Greetings 








Talking to Self 








Asking Self 








Ghecking Self 






Other 


Notice 








Laugh 








Filler 








Disffuency 








Mimic 








Habit 








Response* 








Backchannel* 





Note 1: All utterances are first categorised according to directionality which 
may be either offering (to the listener) or seeking (from the listener). 



Note 2: Responses and backchannel utterances are further subcategorised to 
show the affect type (using the labels listed in the last column of the table) 
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Abstract. Laughter is a powerful means of emotion expression which has not 
yet been used in speech synthesis. The current paper reports on a pilot study in 
which differently created types of laughter were combined with synthetic 
speech in a dialogical situation. A perception test assessed the effect on per- 
ceived social bonding as well as the appropriateness of the laughter. Results in- 
dicate that it is crucial to carefully model the intensity of the laughter, whereas 
speaker identity and generation method appear less important. 



1 Introduction 

This paper describes an explorative study of how laughter can be modelled in speech 
synthesis. In the past, the integration of emotional components in synthetic speech has 
concentrated on changing the tone of voice of the generated speech in order to elici- 
tate distinguishable emotions or emotional nuances [1]. Another means of vocal emo- 
tion expression, which has been neglected in speech synthesis so far, are non-speech 
vocalisations such as laughter with a high communicative and emotional character [2]. 

One basic problem is to predict when to add laughter in synthetic speech (every- 
body knows examples of laughter in inappropriate situations). Another basic problem 
is how to add laughter in synthetic speech. This requires some knowledge of types of 
laughter and how they are used in human-human communication. This also requires 
some knowledge about the acoustic structure of different laughter types and some 
ideas about "laughter synthesis". 

Obvious problems are: can we copy-synthesise laughter with existing speech mate- 
rial or do we need special recordings? Where do we insert laughter with respect to 
syntactic and/or prosodic sentence structure? What type of laughter should be in- 
serted? How do we verify that the synthetic speech with added laughter triggers an 
appropriate perception for the listener? 



2 Laughter in Human Interactions 

2.1 Forms of Laughter 

Although voiced, loud and long laughter is probably the dominant proto-typical form 
when we think of laughter, most laughs in real dialogues are mild forms which are 
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often ignored or just "overheard", even by dialogue annotators [3]. Thus, shorter and 
less intensive laughs seem more appropriate and more realistic for a convincing dia- 
logue. This is in line with [1] who states that emotional nuances rather than "full- 
blown" emotions are required in a dialogue, be it human-machine, human-human or 
machine-machine [4], multi-modal or speech-only (e.g. over a telephone line). 

Human laughter shows a great repertoire [5, 6]. Apart from fully voiced laughter, 
other types of laughter are unvoiced with an oral constriction or nasal turbulences, 
which is also mirrored in words such as "cackle", "giggle" and "chuckle". Forms of 
laughter which are distinct from the ones just listed are those which occur simultane- 
ous to speech, so-called speech-laughs which can occur more often in dialogues than 
autonomous laughs [7]. Speech-laughs are distinct from smiled speech which can 
audibly occur more or less intensely over a longer or shorter period of speech. 

2.2 Laughter in Dialogues 

Laughter usually occurs as a spontaneous action in a social context in conversations. 
Despite the infectious effect of laughter which can be considered as a listener reaction, 
especially in a group of more than two people, many if not most laughs come from the 
speaker [8]. Ultimately, if one wanted to model laughter in dialogues as naturally as 
possible, it would be necessary to consider that speech can occur not just as "neutral" 
speech, but also as smiled speech and with short speech-laughs. In addition, it is not 
uncommon for laughter to be realised as joint laughter of both dialogue partners. 

2.3 Functions of Laughter 

Although laughter and humour are often mentioned in the same breath, laughing is not 
only an expression of exhilaration and amusement, but is also used to mark irony or a 
malicious intention [2]. An important function of laughter is social bonding. It occurs 
in the interactions between mothers and their newborns as well as between adults [8, 
9], especially when the interacting partners are unknown to each other [7]. Social 
bonding is also the function considered in this study for interaction with a computer. 
This is, however, not a straightforward task because "social laughter" is highly deter- 
mined by cultural norms which depend on factors such as age, sex, and social status of 
participants and their relationship [9]. 



3 Pilot Study 

Research question. For the present pilot study, we targeted a simple autonomous 
laugh with a length of two syllables and a mild intensity. We are interested in two 
questions: First, does the integration of a laugh in synthetic speech in a dialogue situa- 
tion lead to the perception of higher social bonding for the listener? Second, is the 
effect achieved appropriately? 

Speech material. Two dialogue extracts were designed as ends of appointment mak- 
ing conversations. The translations of the two extracts are: 
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A: Shall we do it this way? 

B: Okay <Laugh> Then we see each other on Monday. 

A: On Friday, I am only free after twelve. The best thing will be if we 
meet at one on Friday. 

B: All right. <Laugh> That should be fine with me. 

Speech Synthesis. We use the German diphone speech synthesiser Mary [10] with 
the Mbrola voices "de6" (male) for speaker A, and "de7" (female) for speaker B, 
which provide full German diphone sets in “soft”, “modal” and “loud” voice qualities 
[11]. As a baseline, the speech without any laughter was generated with Mary and 
prosodically "fine-tuned". This baseline version was changed by inserting a laugh after 
the first phrase of speaker B. 

The laugh was generated in six different ways'. Versions 1, 2, and 3 considered the 
laugh as quasi-speech ("hehe"): the duration and FO values of natural laughter were 
superimposed onto "modal", "loud" and "soft" diphones taken from the voice “de7”. 
Versions 4 and 5 were unprocessed recordings of natural laughter produced by the 
speaker of the “de7” database, with different degrees of intensity. The last laugh, a 
very mild one, was taken from a natural speech corpus with a different female voice 
{version 6). 

Perception test. 14 German speaking subjects listened to the audio files in random- 
ised order, in a quiet office via PC loudspeakers. After each stimulus the following 
question had to be answered: “Flow well do both speakers like each other?” (a 6-point 
scale between “very well” and “not at all”). After this first round, all stimuli with the 
exception of the no-laughter version were presented again, but this time with the 
question: “Flow well does the laughter fit into the dialogue?” (same 6-point scale) 

Results. The results (see Table 1) show that well-selected laughter can indeed in- 
crease the perceived social bonding between speakers. Laughter synthesised from 
diphones, however, was inefficient in this respect, as was intense laughter. Most effi- 
cient was the very soft laugh from the “wrong” speaker. This softest natural version 
was also rated most appropriate, followed by the medium natural version. Diphone- 
based laughter was considered slightly inappropriate whereas intense natural laughter 
was clearly rated as completely inappropriate. 



Table 1. Evaluation of different versions of laughter, and of the no-laugh baseline. Responses 
from 6 (very well) to 1 (not at all) 







diphone-based 




same speaker 


other sp. 


baseline 


VCl aiUll 


1 


2 


3 


4 5 


6 


question 


modal 


soft 


loud 


intense medium 


mild 


no 


like each other? 


3.5 


3.8 


3.7 


3.9 4.2 


4.7 


3.7 


appropriate? 


2.9 


2.4 


2.5 


1.0 4.7 


5.4 


- 



* Stimuli and spectrograms of laughs can be found at http://www.phonetik-buero.de/laughter/ 
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4 Discussion and Summary 

The results of the present pilot study give first indications that and how laughter can 
be added to synthetic speech so that listeners have the feeling of higher social bond- 
ing. The results also suggest that inappropriate type or intensity of the laugh can de- 
stroy the desired effect in this socially sensitive area. 

Normally, a mixing of generation methods (here: entire laugh recordings mixed 
with diphones) would lead to worse acceptance by listeners [12]. However, our find- 
ings hint that these factors may be less important than a careful control of the laugh 
intensity. Surprisingly, even the use of a different voice did not counter this effect. 

If we want to integrate more laughter types (e.g. for situations with irony, which 
was deliberately excluded here), we have to predict forms of laughter which are ap- 
propriately scaled in intensity. This may be true for other affect bursts as well [2]. 
Clearly more basic research on the phonetics and appearance of laughter is needed. 
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Abstract. The aim of the research is the phonetic articulatory description of 
emotive speech achievable studying the labial movements, which are the 
product of the compliance with both the phonetic-phonological constraints and 
the lip configuration required for the visual encoding of emotions. In this 
research we analyse the interaction between labial configurations, peculiar to 
six emotions (anger, disgust, joy, fear, surprise and sadness), and the 
articulatory lip movements defined by phonetic-phonological rules, specific to 
the vowel /’a/ and consonants fb! and /v/. 



1 Introduction 

In this research we aim to define how the speech articulatory parameters are modified 
by the lip configuration of emotions. First of all we asked ourselves if emotions 
modify significantly the labial configuration of /’a/, /b/ and /v/ targets. Secondly we 
investigated how articulatory parameters are modified by emotions and finally we 
tried to quantify the lip movements emotive range with respect to the linguistic- 
articulatory targets, see [4,5]. 



2 Method 

In order to collect the articulatory and acoustic data, an automatic optotracking 
movement analyser for 3D kinematic data acquisition (FLITF) was used, which also 
allows a synchronous recording of the acoustic signal (for previous applications of 
this data acquisition system to the definition of Italian visemes on an articulatory 
basis: see [1,2, 3, 4]). 

This system tracks the infrared light reflected by small (2 mm diameter), passive 
markers glued on different points of the external lips contour and of the face, 
following the scheme in Fig. 1 . Only the data relative to lip markers are presented in 
this work. 

1 

A male University student pronounced two phonological structures ‘VCV ’ 
corresponding to two feminine proper names: “Aba” /’aba/ and “Ava” /’ava/. 



1 

Concerning the phonological structure of the selected words, we chose the labial voiced stop 
/b/ and the labiodental voiced fricative /v/, as they belong to the 2 classes of consonants for 

E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 233-239, 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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simulating, on the basis of appropriate scenarios, six emotional states: anger (A), 

disgust (D), joy (J), fear (F), surprise (Su) and sadness (Sa) and the neutral production 

2 

(N), corresponding to a declarative sentence In this research we present the analysis 
of the mean values characterizing, for all the articulatory parameters, the targets of the 
vowel /’a/ and the consonants /b/ and /v/, extracted with respect to the mid point of 
the corresponding acoustic signal . 




'aba 




Fig. 1. Position of the 28 reflecting markers and of the reference planes for the articulatory 
movement data collection and time evolution of some kinematic parameters associated with the 
sequence /’aba/ expressing disgust. 

At the same time of ELITE recordings we executed a speaker video recording and 
from this we extracted the following images (see Eig. 2) related to the emotive 
production taken at the mid point of /’a/, /b/ and /v/. 

1 ) 2 ) 3 ) 

/’a/ Ihl M 




Fig. 2. Some examples of emotive labial configurations taken at the mid point of each 
articulatory phonetic target. 



which the lips are the active articulators and execute phonologically specified movements, 
while the central open vowel /a/ offers the widest lip opening and is characterized by the 
largest articulatory space. 

Due to the complexity of the movement tracking algorithm, some acquisitions resulted 
unreliable. For this reason, 7 recordings were used for A, 5 for D, 9 for N, 10 for J, 6 for F, 
10 for SU, 7 for SA in the word /‘aba/, and 7 cases were used for A, 6 for D, 1 1 for N, 8 for 
J, 6 for F, 9 for Su, 6 for Sa in the word /‘ava/. All the obtained 107 recordings considered 
for the articulatory and acoustic analysis were judged by 3 listeners and scored between 4 
and 5 on a 0-5 adequacy scale. 
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3 Articulatory Results 

3.1 First Normalization Data Analysis 

Are the emotive realizations of /’a/, Pol e /v/ targets significantly different each other? 
To answer this first question, for each acquisition session, the articulatory data at 
resting position has been recorded as well, and the extracted parameters have been 
normalized with respect to this values. With this procedure we obtain the data 
released from the lips shape, see [4]: 

The parameters selected to quantify the labial configuration modifications are the 
following: 

• Lip Opening (LO): positive values correspond to the lip opening, negative values to 
lip closure. This parameter correlates with the HIGH-LOW phonetic dimension. 

• Upper and Lower Lip vertical displacements (UL and LL): positive values 
correspond to a reduction of the displacement of the markers from the Q plane 
while negative values stand for a higher displacement. 

• Lip Rounding (LR): corresponding to the ROUNDED-UNROUNDED phonetic 
dimension; negative values correspond to lip spreading, positive values to lip 
rounding. 

• Anterior/posterior movements (Protrusion) of Upper Lip and Lower Lip (ULP and 
LLP): these parameters correlate with the feature PROTRUDED-RETRACTED, 
negative values quantify the lip retraction, positive values correspond to lip 
rounding. 

• Left and Right Corner horizontal displacements (LCX and RCX), calculated as the 
distance between the markers placed on the left and the right lip corner and the 
sagittal Z plane passing through the nose tip and perpendicular to the Q plane. 
Positive values correspond to a higher distance from the plane Z, negative values to 
a lowering of distance from the plane Z. 

• Left and Right Corner vertical displacements (LCY and RCY), calculated as the 
distance between the markers placed on the left and right lip corner and the 
transversal plane Q, containing the line crossing the markers placed on the lobes of 
the ears and on the nose. Positive values correspond to a higher distance from the 
plane Q, negative values to a lowering of distance. 

On the basis of these quantifications it is possible to visually characterize the 
emotions and N production. A 2-factor Anova performed on these parameters (see 
appendix. Tab. 1) verifies that some labial parameters vary significantly not only for 
emotions but for targets too: LO, UL and LLP values are distinguished by the 
consonants. With respect to the other emotions and N production, D and J are both 
characterized particularly by LO and LL lowest displacement values and by the 
highest spreading and retraction values, see [4]. 

3.2 Second Normalization Data Analysis 

Which is the variability of the articulatory parameters determined by emotions? To 
answer to the second question, for each recording, the extracted articulatory parameter 
values, already normalized at the resting position, have been subsequently normalized 
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with respect to the corresponding neutral production data. Since in the neutral 
production vowel and consonants present positive and negative values (see [4]), the 
second normalization procedure values are referred only to the emotion effects. The 
phonetic meanings of these parameter values are the following; 

• Lip Opening (LO): positive values correspond to an increase of lip opening if 
compared to the neutral production; negative values correspond to an increase of 
lip reduction. 

• Upper and Lower Lip vertical displacements (UL and LL): positive values 
correspond to a reduction of the displacement from the plane Q, corresponding to a 
lip rising with respect to neutral, while negative values stand for a lip lowering. 

• Lip Rounding (LR): negative values correspond to higher lip spreading if compared 
to the neutral production; positive values correspond to higher lip rounding if 
compared to the neutral production. 

• Anterior/posterior movements (Protrusion) of Upper Lip and Lower Lip (ULP and 
LLP): negative values quantify the growth of lip retraction with respect to the 
neutral production; positive values correspond to the growth of lip rounding. 

• Left and Right Corner horizontal displacements (LCX and RCX), positive values 
correspond to a higher distance from the neutral; negative values correspond to a 
lowering of distance. 

• Left and Right Corner vertical displacements (LCY and RCY), positive values 
correspond to a lip corner lowering with respect to the neutral production; negative 
values correspond to a lip corner rising. 

The second normalization data were clusterised with Average Linkage Method, see 
[5]. In appendix. Tab. 2, are referred the second normalization parameters mean 
values for emotions and targets. The mean data inspection defines some cross trends 
to the three articulatory targets, grouping in particular J and D with respect to all the 
other emotions. In J, /’a/, /b/ and /v/ have lowest negative values of LCY and RCY, 
corresponding to a vertical rising of the lip corners typical of smile labial 
configuration. This vertical rising is overlapped to the linguistic articulatory lip 
configurations. J has negative values of LO and LR and attendant negative values of 
UL, LL, ULP and LLP. All these values quantify a labial configuration spreaded, 
retracted and with lip corners raised. D presents in /’a/, /b/ and /v/ the lowest negative 
value of LO, with consequent negative values of UL and LL, and lowest negative 
values of LR, with the reduction of ULP and LLP, corresponding to the highest lip 
retraction and highest values of LCX and RCX. As a result the labial configuration of 
D is characterized by lowest opening, highest spreading, highest retraction, with 
respect to N production. 



3.3 Emotive and Articulatory Space Interaction 

How articulatory parameters vary trough the different articulatory targets in the 
realization of different emotions? To answer this third question we analysed 
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3 

contrastively the range of the different parameters values, taken from the second 

normalization, in the realization of the three articulatory targets (see Appendix, Tah 

3.)- It can be noted that these ranges change not only with respect to the articulatory 

parameters but overall with respect to the target typology. 

It can be noted that: 

• LO has, in the vowel /’a/, different values than in Pol and /v/. In fact both the 
consonants, have lower value range of LO than in /’a/, due to the realization of 
different degrees of constriction. 

• LL has a parallelism with LO because the LL movements determine mostly LO 
values. 

• UL range has similar values in /’a/ and /v/ where, for the different emotions, there 
is the highest displacement reduction. In fact in /v/ UL do not contribute to the 
labiodental constriction; further it undergoes to the coarticulatory effect from the 
contextual vowel. In Pol UL values has a role of active articulator in the bilabial 
closure, lowering with respect to N. 

• LR shows similar value range in the targets /’a/ and Po! but not in /v/ because the 
labiodental articulation characterizing this last consonant lowers the variability of 
the labial setting, determining a narrow value range. 

• In ULP and LLP there are an opposition of the values of these parameters with 
respect to the different emotions in /’a/ and /v/ than in /b/. In the emotive speech 
we don’t notice a parallelism between LR, ULP and LLP parameters, verified 
instead in the description of the labial characteristics of Italian vowels [1,3]. 

• For LCX the value range shows a substantial similarity in all the three phonetic and 
articulatory targets, while RCX has a difference in /v/ value range that is narrower 
than in /’a/ and Pol. Probably these values are due to D asymmetry. 

• For LCY and RCY we find a similarity between /’a/ and /v/, with higher 
displacements than in Pol, which has reduced corners vertical displacement. Further 
we find a higher range value in the left corner than in the right one, probably due to 
J asymmetry. 



4 Conclusions 

Although the data presented in this paper are only a short first contribution to the 
issue of visual characteristic of emotive speech, several results are interesting. It has 
been shown that in the emotive speech the articulatory targets, /’a/, /b/ and /v/, vary 
significantly with respect to the emotions (first normalization results § 3.1). Further 
the cluster analysis let us characterize, with respect to the articulatory data, 
similarities and differences between emotions (§ 3.2). Finally the articulatory 
parameters range analysis (§ 3.3), checked on the second normalization data, reveals a 



^ For each parameter the value range was calculated as the difference between the lowest 
parameter value and the highest one of the six emotions set. 
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variability of the articulatory parameters with respect to the articulatory and phonetic 
targets. This is due to the labial configuration of emotions which can not violate the 
linguistics articulatory characteristics. 
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Appendix 



Table 1. Results for the 2-factor ANOVA (emotion, stimulus, interaction) on articulatory 
parameters for vowel /’a/ and consonant /b/-/v/, F-ratio and p- value associated. * p<0.01 





/’a/ 


/b/-/v/ 


Par. 


Emot. 


Stim. 


Int. 


Emot. 


Stim. 


Int. 


LO 


32.681* 


0.179 


0.826 


8.002* 


103.890* 


1.012 


LR 


33.484* 


0.940 


0.673 


63.725* 


0.624 


4.491* 


UL 


41.203* 


2.641 


0.862 


19.488* 


81.444* 


2.860 


LL 


41.259* 


0.097 


0.888 


20.926* 


2.727 


0.382 


LLP 


23.424* 


3.354 


1.959 


19.018* 


133.397* 


5.027* 


LCX 


31.940* 


0.028 


0.642 


47.846* 


0.350 


3.856* 


RCX 


32.696* 


2.630 


0.873 


64.276* 


0.183 


4.424* 


LCY 


37.448* 


0.540 


1.323 


41.114* 


28.010* 


3.257* 


RCY 


31.167* 


0.089 


1.063 


47.172* 


57.270* 


5.228 
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Table 2. Mean values of the second normalization parameters (expressed in mm) for emotions 
and targets. 





/’a/ 


/b/ 




A 


D 


j 


F 


Su 


Sa 


A 


D 


j 


F 


Su 


Sa 


UL 


0.4 


4.2 


2.2 


- 1.2 


- 0.5 


- 0.8 


-1 


1 


0.6 


- 2.2 


-1 


- 0.6 


LL 


- 4.9 


6.9 


4.5 


- 7.7 


4 


- 0.6 


0.02 


4.1 


1.3 


- 3.7 


- 0.6 


- 1.2 


LO 


5.9 


- 3.8 


-3 


9.2 


- 0.03 


0.2 


- 0.8 


- 3.1 


- 0.9 


3.3 


0.2 


1.5 


ULP 


1.4 


- 4.1 


- 2.9 


2.8 


1.1 


2.3 


1.2 


- 3.6 


- 0.5 


1.8 


0.7 


1.8 


LLP 


1.8 


- 2.8 


- 2.2 


0.5 


0.1 


1.4 


1.4 


- 3.5 


- 0.7 


- 0.01 


- 0.6 


0.6 


LR 


0.7 


- 7.9 


- 7.8 


2.7 


0.3 


1.4 


0.7 


- 8.5 


- 4.6 


1.6 


0.2 


0.7 


LCX 


- 0.7 


3.5 


4.2 


- 1.6 


- 0.4 


- 0.8 


- 0.9 


3.7 


2.5 


- 1.4 


- 0.4 


- 0.8 


RCX 


0.07 


4.2 


3.5 


- 1.1 


- 0.03 


- 0.6 


0.3 


4.7 


2 


- 0.2 


0.07 


- 0.01 


LCY 


1.3 


- 0.8 


- 8.3 


4 


1.1 


1.7 


0.5 


2.3 


- 5.6 


2.3 


0.6 


1.3 


RCY 


0.7 


- 2.4 


- 7.2 


3.1 


0.1 


1.3 


- 0.8 


- 0.8 


-5 


1.4 


- 0.4 


- 0.3 





M 1 




A 


D 


J 


F 


su 


SA 


UL 


- 0.3 


4.1 


3.1 


- 1.3 


- 0.3 


- 0.8 


LL 


- 0.8 


3..3 


0.5 


- 3.7 


- 1.7 


- 3.1 


LO 


1.1 


- 0.2 


2 


- 3.7 


1.8 


2.3 


ULP 


0.7 


- 4.3 


- 3.7 


2.2 


1.4 


1.2 


LLP 


0.2 


- 2.3 


- 2.1 


0.5 


1.1 


1.8 


LR 


- 1.3 


- 0.04 


0.01 


- 1.2 


- 0.03 


0.4 


LCX 


- 0.6 


2.2 


4.2 


-1 


- 0.6 


- 0.5 


RCX 


0.2 


3.1 


3 


- 0.01 


- 0.2 


- 0.6 


LCY 


2.2 


1.4 


- 7.7 


3.4 


2 


2.5 


RCY 


1.4 


0.1 


-6 


2.7 


1.5 


2.8 



Table 3. Value range of the different articulatory parameter in the three articulatory targets 
(second normalization data). 





/’a/ 


/b/ 


Ivl 


LO 


13 mm 


6.4 mm 


3.9 mm 


UL 


5.4 mm 


3.2 mm 


5.4 mm 


LL 


14.6 mm 


7.8 mm 


7 mm 


LR 


10.6 mm 


10.1 mm 


1.7 mm 


ULP 


6.9 mm 


5.4 mm 


6.5 mm 


LLP 


4.6 mm 


4.9 mm 


4.1 mm 


LCX 


5.8 mm 


5.1 mm 


5.2 mm 


RCX 


5.3 mm 


4.9 mm 


3.7 mm 


LCY 


12.3 mm 


7.9 mm 


11.1 mm 


RCY 


10.3 mm 


6.4 mm 


8.8 mm 
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Abstract. We present our current state of development regarding animated 
agents applicable to affective dialogue systems. A new set of tools are under 
development to support the creation of animated characters compatible with the 
MPEG-4 facial animation standard. Furthermore, we have collected a 
multimodal expressive speech database including video, audio and 3D point 
motion registration. One of the objectives of collecting the database is to 
examine how emotional expression influences articulatory patterns, to be able 
to model this in our agents. Analysis of the 3D data shows for example that 
variation in mouth width due to expression greatly exceeds that due to vowel 
quality. 



1 Introduction 

At KTH we have for a long time been developing animated talking heads and 
studying their use in various human-machine [1,2] and human-human settings [3]. 
This paper presents an overview of the current status of a number of activities related 
to the area of animated agents that are applicable to affective speech-based dialogue 
systems. The activities include development of new tools and models for expressive 
facial animation and collection of multimodal corpora of expressive speech. These 
activities are partly conducted within the EU-IST project PF-STAR, a project aiming 
at defining technological baselines within several areas related to affective interaction 
with machines, for example emotional speech synthesis and facial animation. 



2 Face Modelling 

We are developing a new set of tools and techniques to facilitate the development and 
animation of face models adhering to the MPEG-4 Facial Animation standard [4]. 
The standard defines 66 low-level facial animation parameters (FAPs) that describe 
the animation of a face model. The MPEG-4 FA standard opens up the possibility of 



' Names in alphabetic order 

E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 240-243, 2004. 
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sharing data and models with other researchers in the area, a fact that is central to the 
work within the PF-STAR project, where there are three different sites involved in 
facial animation work, and the MPEG-4 standard has been chosen as the common 
format for data exchange. 

Development of an animated face model is typically a tedious and time-consuming 
task. To alleviate the task, we have developed tools that automate the model creation 
process, enabling creation of high-quality standards compliant MPEG-4 FA models 
from arbitrary static 3D-meshes with a minimum of user-intervention and taking 
advantage of the capabilities available in state-of-the-art 3D-modelling packages 
when it comes to actual sculpting and texturing of the models. The tool set consists of 
a 3D-modelling package plug-in, for selecting landmark points on the face and 
building the animated model, as well as a lightweight custom rendering engine. To 
increase the expressiveness of the model, the rendering engine includes the capability 
of dynamically render wrinkles in the face based on local estimates of the 
compression of the skin [5]. To achieve real-time performance, the wrinkles are 
rendered using hardware-accelerated bump-mapping, leveraging the power of the new 
generation of graphics processing units. 



3 Data Acquisition 

To gain knowledge about how to drive our agents, in terms of expressive non-verbal 
and verbal behaviour, we have collected multimodal corpora of emotive speech using 
an opto-electronic motion tracking system, MacReflex from Qualisys^. By using 
reflective markers applied on the speaker’s face it is possible to record the 3D 
position for each marker with sub-millimetre accuracy, every 1/60* second, by using 
four infrared cameras. 35 markers were used to record lip movements as well as other 
facial movements such as eyebrows, cheek, chin and eyelids. Five markers attached to 
a pair of spectacles were used as a reference to be able to factor out head and body 
movements. In addition to 3D marker positions, video and audio was recorded. 

Two corpora of expressive speech have been collected. Corpus 1 was a sample 
recording aimed at evaluating the feasibility of different elicitation techniques such as 
reading prompts and interactive dialogue. Corpus 2 consisted of non-sense words and 
short sentences, providing good phonetic coverage. 

Corpus 1 was made up of two sub-corpora, one of prompted speech and one of 
naturally elicited dialogues. The prompted material consisted of digit sequences and 
semantic neutral utterances. 15 different expressions were chosen; together with the 
six universal prototypes for emotions: anger, fear, surprise, sadness, disgust and 
happiness [6], we also had the subject to act worried, satisfied, insecure, confident, 
questioning, encouraging, doubtful, confirming and neutral. For the dialogue sub- 
corpus, an information-seeking scenario was used. This communicative scenario is 
similar to the one that might arise between a user and an embodied conversational 
agent in a dialogue system. One of the dialogue participants had the role of 
“information giver”. The domains were movie information (plots, schedules), and 



^ http://www.qualisys.se 
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direction giving. The focus of the recording was on the “information giver”, and only 
his movements were recorded. Audio recordings included both subjects. 

Corpus 2 consisted of VCV & VCCV nonsense words, CVC nonsense words and 
short sentences. An actor read the words and sentences while acting six different 
emotional states, a sub-set of the emotions used in corpus 1: confident, confirming, 
questioning, insecure, happy, neutral. A total of 1700 items were recorded. 

In order to apply the recorded 3D data to the face models, MPEG-4 FAPs were 
extracted, by establishing linear relationships between the MPEG-4 FAPs and the 
displacements of the markers from the neutral state (after compensating global head 
motion). 



4 Data Evaluation 

A test was conducted to classify the data collected in corpus 2. A group of 13 
volunteer Swedish students from KTFI (6 female and 7 male) was presented with a 
total of 90 stimuli, consisting of digitised video-sequences of the Swedish actor 
uttering a random selection of the sentences in corpus 2 with the six expressions. The 
test was run in a plenary session, the stimuli were presented using a projected image 
on a wide screen, in random order, without the audio. Before the experimental session 
the participants were instructed to look at the video files and after each video-file 
select one of the seven options on the answering sheet, consisting of the six 
expressions and an extra category for “other”. The latter was inserted to avoid forced 
choice and a possible over-representation of neutral. 

The percentages have been calculated over 78 stimuli, the first and last six stimuli 
responses were “dummies”. All the expressions were identified above chance level. 
The identification rates were 85% for happy, 50% for confirming, 37% for certain, 
29% for questioning, 46% for uncertain and 70% for neutral. No significant 
differences between the responses given by female and male subjects were found. 



5 Expressive Articulation 

Most systems for visual speech synthesis are modelled on non-expressive speech, i.e. 
the material is read with a neutral voice and facial expression. However, 
expressiveness might affect articulation and how we produce speech a great deal, and 
an articulatory parameter might behave differently under the influence of different 
emotions. This can be deduced from a quick analysis of vowels in our database, of 
which an example is presented in figure 1. The probability density function of the 
position of the left mouth comer measured in the middle of all the vowels in the 
material is displayed as an ellipse. 
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Fig. 1. Mean horizontal and vertical position of the left mouth comer for five acted expressive 
states: happy, encouraging, angry, sad and neutral from left to right. 

It can be seen that the expressive state in some instances has a stronger influence 
on the articulation than do the different vowels. It is also interesting to note that the 
neutral pronunciation displays a pattern different from all the (acted) expressive 
speech versions, with very little variation between vowels and a presumably small 
mouth opening. In this study we did not look into the dynamic influence on the 
segmental articulation in the expressive speech. How much could be described by 
relatively stable settings and what is best described by dynamic expressions is the 
topic of some of our current research. 
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Abstract. We discuss the affective aspects of tntoring dialognes in an 
ITS -called INES- that helps students to practice nursing tasks using 
a haptic device and a virtual environment. Special attention is paid to 
affective control in the tutoring process by means of selecting the appro- 
priate feedback, taking into account elements of the student’s character, 
the harmfulness of errors made, and the emotional effects of errors and 
feedback. The tutoring system will adjust both the affective language 
used and the overall teaching tactics. 



1 Introduction 

Affect control^ is an important factor in social interaction in general. Sometimes 
learning is more effective with the help of a personal tutor. But this is very 
expensive and therefore it could be useful to develop intelligent personal tutoring 
systems that help students learn either a conceptual theory, to solve theoretical 
or practical problems or to master a practical task. In this paper we discuss 
affect control in a tutoring system called INES that helps nursing students in 
practicing tasks such as giving a patient a subcutaneous injection. In nursing 
schools students practice this task with a fake plastic arm. A teacher is present 
who gives advice and feedback. This is very time consuming and expensive. 

With the INES system students can practice the tasks on the computer using 
a haptic device, a Phantom, to give an injection in the arm of a virtual patient. 
This can be seen in the picture in Fig. 1. The tutor is visualized by a talking 
head that can display some emotion expressions.^ 

Our aim is to make an ITS that builds on the psychology of affect and emotion 
to make the learning process more efficient. What should the tutor say or do in 
a particular situation to optimize the learning curve? 

The part of the ITS that we will focus on in this paper^ is the instruction 
module that decides on the sequence of instructions that will be given (see Fig. 
2 )^ 

^ “People credit themselves and others with specific identities during social encounters. 
They then engage in physical and mental work so that events create impressions 
that maintain sentiments attached to their identities, as well as to other categories 
of action (i.e., behaviors, settings, and personal conditions - emotions included).” 
From [1]. 

^ More about the agent-based architecture of this rather complex multi-modal inter- 
action system can be found in [2]. 

® More information about INES can be found in [3] and [4]. 
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Fig. 1. A picture of a student working with the system. 



Abstract User Interface 
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Expressive Face 




Fig. 2. Overview of the different agents and components in the tutor component. 



The module termed Learning Components, decides on what kind of learning 
component (see [5]) or teaching strategy is taken. For instance, if the Socra- 
tic Method is active, the tutor will tend to ask more questions. The choice of 
Learning Component can be changed dynamically during a tutoring session. The 
choice is mainly based on the activity level of the student and the particular stage 
in an exercise. In the current prototype implementation we have considered ac- 
tive student learning, deep explanatory reasoning and affect and motivation as 
learning components and the Socratic method as a pedagogical strategy. 

So-called error-agents provide the instruction module with information about 
how the student is performing the task. The kind of error students make, the 
harmfulness of the error, and the number of times a student has made it are the 
main input to determine the kind of response. 

The INES system also contains emotion appraisers. These set the intensity 
of a few emotion types based on the events that occur. There are two sets of 
values. One set models the tutor emotions and another set the student emotions. 
These values are taken into account in determining the nature and the form of 
the dialogue act that will be performed. 
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In the following sections we will go into some more detail on how the system 
works and the considerations that have let us to set up the system in this way. 
We first present some of the background on tutoring which we used in our work. 



2 Affect in Tutoring 

Tutoring situations can be characterized as a social encounter, the goal of which 
is for a student to learn some task or acquire knowledge with the tutor acting in 
all kinds of ways to assist the student with this goal. As with about all actions 
we carry out, our emotional state plays an important part in the selection of our 
actions and the evaluation of the result. The actions we carry out and their results 
have an impact on how we feel, in turn. Bales ([6]) systematically observed groups 
in laboratories and found that a substantial proportion of group interaction 
is devoted to the socio-emotional issues of expressing affect and dealing with 
tension. Likewise, the actions of a tutor are also not just restricted to pure 
instructions but they should also create the right emotional conditions for a 
student to act. The fact that the tutoring situation is a social encounter means 
that influencing the emotional state proceeds through social acts with emotion 
changing potential. For instance, the tutor has the status to judge (criticize or 
praise) the student for his actions. There are several other interpersonal actions 
that give rise to affect appraisals. Consider, for instance, the difference between 
the psychological effect of formulating a task as an order or as a suggestion. The 
tutor has to steer and motivate the student, know when the student welcomes 
a hint, etcetera. In general one of the goals that people want to come out of of 
social interaction is to enhance the self of each actor. The ideal outcome is that 
the student is proud of his achievements and feels highly estimated by the tutor. 
The emotional state related to this form of social interaction typically involves 
elements and variables such as: social rewards, dependence, status, power, and 
face. 

The emotional state of the student contributes a lot to whether a student is 
motivated or challenged, which are key conditions for certain actions. Curiosity 
and puzzlement may lead to investigate problems. But also frustration may lead 
to action, even though it is a more negative affect (see [7]) The tutor can choose 
to consider taking certain actions to bring about a change in the emotional 
state. Lepper ([8]) identified four main goals in motivating learners: challenge 
them, give them confidence, raise their curiosity and make them feel in control. 
These goals can be achieved by means of various tactics. The student can be 
challenged by selecting appropriately difficult tasks, or by having the difficulty 
emphasized or by having some kind of competition set up. Confidence can be 
boosted by maximizing success directly (praising) or indirectly (‘it was quite 
a difficult task, you managed to do’). Curiosity is typically raised in Socratic 
methods when the student is asked to ponder many questions. The tutor can 
decide to leave the initiative to the student or offer options that suggest the 
student can make choices and thereby influence the student’s feeling of being in 
control. 
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Lepper’s goals have been taken into account in our system in the choice 
of learning component or teaching strategy (Socratic methods, active student 
learning, using deep explanatory reasoning, etcetera). 

Affective parameters in tutoring dialogues not only concern the choice of 
conversational act and the overall tutoring strategy, but also the kind of language 
that is used. In tutoring dialogues we see a number of specific dialogue acts, 
characteristic of tutoring. They make tutoring conversations different from other 
types of conversations, say information dialogues and task dialogues. Some of the 
dialogue acts are special for affect control; either for controlling the atmosphere 
of the conversation or for controlling the motivational state of the student. This is 
an immediate consequence of the specific roles, goals and tasks the participants 
have in a tutorial conversation. The student’s goal is to learn something, to 
solve a theoretical problem or to master a practical task. The task of the tutor 
is to encourage the student’s active learning behavior by guiding him, giving 
hints, asking questions to stimulate the student to reflect on what he is doing, 
carefully avoiding to give away the solution to the problem. Types of responses 
that are completely unsatisfactory in other types of conversations are common in 
tutorial dialogue. Tutorial dialogue shows extensive use of acknowledgement acts 
and accept acts by the tutor - showing his understanding and encouragement - 
and a lack of explicit signals from the student’s side that he will cooperate. In 
case the learner has to master a complex practical task, the tutor has to follow 
the students actions, to see whether he’s doing the subtasks in a correct order 
and whether he’s doing the subtasks well. The tutor has to evaluate the student’s 
actions and give appropriate feedback, positive or negative, dependent on the 
student, and the complexity of the task. When the student makes an error he can 
choose to interrupt the student or to leave him and wait until the student finds 
out himself. The tutor has to see whether the student makes enough progress in 
learning the task, and, if not, to figure out what are the causes of the mistakes he 
makes. In all of these tasks the tutor should have a good feeling for the affective 
and motivational state of the student. The tutoring systems need rather involved 
reasoning and planning mechanisms for realizing all these functions. 

In the following section we will look in more detail how the INES system 
deals with providing the right kind of instructions. 



3 Tutoring Dialogues and Tutoring Strategies in INES 

Affect in INES is considered from various angles and on different levels. 

~ The emotional state of the student will be modelled by an event appraisal 
system. 

— The emotional state of the tutor is modelled as well, including values for 
emotions and parameters such as satisfaction, disappointment, and surprise. 

— The dialogue acts come in different forms, with variation in affective values. 

— Various affective parameters are used in determining which tutoring stra- 
tegy to use and which instructional act to perform (sympathizing or non- 
sympathizing feedback, motivation, explanation, steering, etcetera). 
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The main elements that go into calculating the emotional state of the tutor 
are: the level of activity of the student, the difficulty of the task, the history of 
the student (how well he/she did in previous tasks), the number of errors made, 
the severity of the error. For the tutor model we have limited our implementation 
to four values, representing the emotions joy, distress, happy-for and sorry-for. 
With respect to the student model, there are four emotional categories that affect 
the selection of the kind and form of instruction. These are anxiety - confidence 
and dispirited - enthusiasm. In the current prototype these remain fixed during 
a session and thus act more like personality profiles. 

It seems less important that tutors show their own real emotions. But we 
think that tutors should show empathy with the students and understand how 
their emotional state is affected by what the tutors do. In order to make a good 
guess of how the student’s affective system reacts in particular situations the 
tutor also has to have some idea of the character of the student: introverted, 
self-critical, extraverted, anxious to do something wrong, or self-confident. For a 
believable tutor agent it is necessary that he monitors the emotional and moti- 
vational state of the student. Recognition of the affective and emotional state of 
the student is a difficult task for a tutoring system. It assumes automatic affect 
recognition in text and prosodic properties of speech, in facial expressions and in 
other aspects of non-verbal behavior. Moreover, we need ways to automatically 
detect features to induce the state of concentration, motivation and engagement 
of the student. We have started to investigate some of these techniques. Instead 
of trying to determine the emotions of the real student by means of affective 
computing techniques, the ITS could also builds up a model of the student emo- 
tions by using appraisal rules. This is the route that we have pursued for the 
current version. 

In working with INES, students start by choosing an exercise to perform. The 
system uses descriptions of exercises that contain information about the different 
subtasks involved (partially ordered), the kinds of errors, pointers to error-agents 
that can detect the errors, error messages, and steering acts, explanations, hints, 
questions and answers. Many of are these are used to specify the precise output 
of INES. The output consists of text messages, spoken output, expressions on 
a 3D face, and haptic feedback through the phantom. The main conversational 
and tutoring functions that have to be performed are: make conversation (wel- 
come, introduction, closing), give instruction, set a task, answer questions, ask 
a question, provide support, give feedback, explain, evaluate, motivate, expect, 
steering. 

The task of the tutor is mainly to guide the student by monitoring his be- 
haviors and to give feedback. When the student performs these tasks, special 
dedicated error agents monitor the actions of the students and the order in 
which they are performed. When these agents observe an error (wrong angle 
pitch during injection, movements too fast, etc.) they send an error report to 
the instruction agent that evaluates these reports and decides upon a responding 
action. 

The type of action and the manner in which it is executed is determined by a 
number of factors. The tutor decides on a specific tutoring strategy or learning 
component (which might be changed in the course of the lesson) . 
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The selection of the strategy depends on the activity level of the student. 
Three levels are introduced on which the tutor can operate. The idea is that 
there is congruence between the student’s self steering and the tutor’s external 
steering. Thus, on level 1, when the student isn’t active at all, fixed external 
steering is used. The tutor dynamically switches between the different levels, 
taking into account the performance of the student. The idea of using different 
levels of activity is based on an algorithm in [9] , which determines what kind of 
hint (with what kind of activity) should be given taken into account the student’s 
current and previous answers, the number of wrong answers, the previous hints 
produced and number of hints produced. 

The choice of strategy influences the kinds of actions that are performed 
and the sequence in which they are performed. For instance, when the Socratic 
method is active, the tutor will choose to ask a lot of questions. We are interested 
in having our tutoring system teach the students certain practical skills with an 
insight into the reasons for doing tasks in a specific way. An important part 
of tutoring sessions is therefore to let the students practice some task. In this 
case the tutor simply observes what the students does and provides feedback. 
Ideally, the student should be active and self-motivated to practice the tasks. 
Because the typical exercises involve practicing a task, the error-agents form an 
important component of the system. They observe the student’s actions, diagnose 
them and report to the instruction-agent: the multi-modal dialogue manager or 
brain behind the ITS that determines the sequence of instructions. For instance, 
the sequence (feedback, motivation, support). This is called the teaching task 
support-tt and is typically instantiated in case of an haptic error. For each of the 
instructions, called dialogue tasks, a dialogue function will have to be chosen. 
These are more specific instances of the dialogue tasks. For support there are 
functions such as ‘hint correct answer’, ‘hint error emphasize’ and ‘hint error 
diagnosis’. For each dialogue function there are several specific dialogue acts. 
These can be verbal or nonverbal messages (head nods, for instance), or feedback 
given through the haptic device. 

Figure 5, at the end of this paper shows a list of dialogue tasks and dialogue 
functions that we identified for a teacher tutoring procedural tasks. Each of the 
functions can only be applied when a number of preconditions are satisfied. In 
the prototype a selection of the dialogue tasks has been implemented. These are 
marked with an asterix. We have left out tasks such as conversation (used for 
welcoming, introductions and closings), questions and answers. 



4 Some Initial Tests 

Several types of tests of the system have been performed but no user evaluation 
has been carried out so far. We tested the effect of various parameters on the 
choice of dialog acts by letting the system respond to the same input with dif- 
ferent settings for these parameters. The behavior of the student was simulated 
according to specific scenario’s. The tests were conducted to see how the tutor’s 
conversational responses and his emotion expression, varies with the settings of 
a number of parameters in a scenario in which the student repeatedly makes the 
same error, namely he brings in the needle under an angle that is too wide. The 
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tests were performed in varying situations in which the following features were 
controlled. 

1. Tutor shows his own emotion or not. Four emotion categories were implemen- 
ted and tested for their effect on the conversational behavior: joy, distress, 
happy-for and sorry-for. 

2. The tutor takes the student’s emotions into account whether the student is 
either confident or insecure^ 

3. The tutor follows a Socratic learning strategy and stimulates active student 
learning. The tutor then favors asking questions above giving lengthy expla- 
nations. 

4. The harmfulness of the error the student makes. 

Figures 3 and 4 show a dialogue between the emotional tutor with a confident 
versus an insecure student, respectively. 



Student: Uses an angle that is too wide 
Tutor: Shows a negative facial expression (sadness) 

Student: Uses an angle that is too wide 

Tutor: Not quite. It’s okay. Take care of the angle with which you put the needle in. 
Student: Uses an angle that is too wide 

Tutor: Stop. That’s wrong! You shouldn’t inject with an angle that is too wide. Try 
it again. 

Student: Uses an angle that is too wide 
Tutor: Try again. I will steer you along this time. 

Student: Uses an angle that is too wide 
Tutor: Decrease the angle. 

Fig. 3. The dialogues in the test: between a confident student and an emotional tutor 

Student: Uses an angle that is too wide 
Tutor: Shows a negative facial expression (sadness) 

Student: Uses an angle that is too wide 

Tutor: Not quite. You use a wrong angle. Okay, start again. 

Student: Uses an angle that is too wide 

Tutor: Not quite. It doesn’t matter. Take care the angle isn’t too wide. 

Student: Uses an angle that is too wide 

Tutor: Wait a minute, you’re in the wrong direction. It’s okay. Try to keep an angle 
of thirty to fourtyfive degrees. 

Student: Uses an angle that is too wide 

Tutor: Not quite. I will give a demonstration of the task. 



Fig. 4. The dialogues in the test: between an insecure student and an emotional tutor 



^ These states correspond with two quadrant spaces - anxiety vs. confidence and di- 
spirited vs. enthusiasm of Reilly and Kort. ([7]). 
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Dialogue Task 


Dialogue function 


Explanation/Example 


SUPPORT 


HINT-ERROR-DIAGN* 


Hint including the error made 
“Take care of a correct angle” 




HINT-ERROR-EMPH* 


Hint including the error made 

“Take care of a smaller angle of the needle” 




HINT-CORRECT-ANSWER* 


Hint including the correct answer 

“Take care that the angle with which you put 

the needle into the arm is 30 to 45 degrees” 


FDB-TURN-TU 


FEEDB-ACCEPT* 


Accepting feedback 
“That is the correct answer” 




FEEDB-PART-ACCEPT* 


Partly accepting feedback 
“Something like that” 




FEEDB-PART-REJECT* 


Partly rejecting feedback 
“Not so much that” 




FEEDB-REJECT* 


Rejecting feedback 

“That isn’t the correct answer” 


FDB-TURN-ST 


FEEDB-ACKN-BACK* 


Acknowledging feedback 
“Uh-huh” 




FEEDB-BACK-QUESTION 


Question back 
“Is that right?” 




FEEDB-NON-UNDERST 


Non understanding feedback 
“Excuse me?” 




FEEDB-REPEAT 


Repeating feedback 

“So your answer is 45 degrees?” 


EXPLAIN 


ERROR-DIAGN* 


Explanation including the error made 
“The error you made has to do with 
the angle in which you put in the needle” 




ERROR-EMPH* 


Explanation including the error made 
“You shouldn’t bring in the needle with 
such a big angle” 




CORRECT-ANSWER* 


Explanation including the correct answer 
“You should bring the needle in 
with an angle of 30 to 45 degrees” 


EVALUATE 


STUD-EVAL-POS* 


Positive evaluation of the performance 
“You did a good job” 




STUD-EVAL-NEG* 


Negative evaluation of the performance 
“You didn’t do it very well” 




ERROR-ENUM* 


Error enumeration 

“You should take care of the angle 

in which the needle is brought in” 




ANSWER-ENUM 


Answer enumeration 

“Just remember that the needle should be 
brought in with an angle of 45 degrees” 




EXPECT-LATER 


Expectantions for later 

“I expect you to do it right next time” 


MOTIVATE 


M- WRONG- ANSWER* 


Motivate when a wrong answer is given, 
“I’m real sorry” (sympathy) 

“That’s all right” (downplayer) 




M-CORREGT-ANSWER* 


Motivate when answer correct (enthusiasm) 
“Good job, I knew you could do it.” 


EXPECT 


EXPECT-NOW 


Expectations for now 
“It’s a difficult task” 


EXCLAMATION 


EXCLAMATION-NEG* 


Negative exclamation 

“How did you get to such a wrong answer!”” 


STEERING 


STEER* 


Steering at real-time. 
“Decrease the angle.” 



Fig. 5. Dialogne tasks and fnnctions 
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These two dialogue fragments clearly show that setting the parameters diffe- 
rently causes significant differences in the teaching tasks and dialogue functions. 
In the second case the tutor responds in a more sympathizing way, using more 
motivational verbal instructions, no exclamations and no total rejection. We have 
let different people judge the various scenario’s (showing them not just how the 
dialog unfolded but also the facial expressions) and in general the output differs 
along with what people would expect to change in the various situations. Of 
course, the dialogue is contrived, because the student keeps making the same 
error. Also, the prototype uses several responses that may seem a bit awkward 
or exaggerated. This can be fixed by adjusting the exercise files and is therefore 
a matter of scripting. We will need to fine-tune one or more exercises before we 
can carry out further evaluations of the system. 



5 Conclusion 

We have refined an earlier version of our tutoring system with several modules to 
account for affective issues during tutoring. The emotional states of students and 
the tutor are modelled and taken into account in various ways in the tutoring 
process. The elaborate instruction model is build around pedagogical strategies 
and learning components (Socratic methods, active student learning, etcetera). 
The tactics used by the tutoring situation depends to a large extent on the mental 
state a student is in or should be in. Besides considering the overall strategy, 
the emotional states also have an effect on the selection of dialogue acts and the 
precise phrasing. 

The system does not recognize the students emotional state in any way. We 
make some basic assumptions about what we expect the student to feel. These 
are rules of thumb, for instance, frustration gets bigger when things keep on going 
wrong. We are currently also investigating ways to determine some aspects of 
the emotional state of the real student. 
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Abstract. Analysis of student-tutor coaching dialogs suggest that good human 
tutors attend to and attempt to influence the motivational state of learners. 
Moreover, they are sensitive to the social face of the learner, and seek to miti- 
gate the potential face threat of their comments. This paper describes a dialog 
generator for pedagogical agents that takes motivation and face threat factors 
into account. This enables the agent to interact with learners in a socially ap- 
propriate fashion, and foster intrinsic motivation on the part of the learner, 
which in turn may lead to more positive learner affective states. 



1 Introduction 

Animated pedagogical agents, or guidebots, exploit human-like communication mo- 
dalities, such as speech and nonverbal gesture, in order to promote more effective 
learning [10]. Learning materials incorporating animated agents can engender a 
higher degree of learner interest than similar learning materials that lack such agents 
[16], and reduce the perceived difficulty of the learning material [1]. They can also 
produce a positive affective response on the part of the learner, sometimes referred to 
as the persona effect [13]. This is attributed to the natural tendency for people to 
relate to computers as social actors [23], a tendency that animate agents exploit. 

Educational researchers increasingly recognize the importance of learner affective 
states in promoting effective learning. Of particular importance factors such as self- 
confidence and interest that contribute to learner intrinsic motivation [25]. Expert 
human tutors are also able to recognize when learners have negative affective states 
due to poor motivation or low sense of self-efficacy, and can try to influence learner 
motivation through encouragement and other motivational tactics [12]. Since ani- 
mated agents can display emotion via speech and gesture, it is reasonable to suppose 
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that they could be well suited to promoting positive affective states. Yet clearly doing 
this properly involves more than generating emotional displays per se, but generating 
the appropriate type of display at the right time in order to influence the learners in the 
appropriate way. 

This paper describes a model for dialog generation in guidebot designed to emulate 
tactics that human tutors use to influence learner motivation. It takes into account 
characteristics of the social relationship between the guidebot and the learner, as well 
as learner motivational factors and individual learner characteristics. The work is the 
combined effort of several students of the University of Twente, each of whom per- 
formed internships of a few months in duration at the Center for Advanced Research 
in Technology for Education (CARTE) at USC / Information Sciences Institute. It is 
part of a larger project building a socially intelligent pedagogical agent able to moni- 
tor learner performance and provide socially sensitive coaching and feedback and 
appropriate times [9]. 



2 Background Studies on Tutor-Learner Interaction 

To investigate the role that social intelligence plays in learner-tutor interaction, we 
videotaped interactions between learners and a human tutor while the students were 
working with a particular on-line learning environment, the Virtual Factory Teaching 
System (VFTS) [7]. Students read through an on-line tutorial in a Web browser, and 
carried out actions on the VFTS simulation as indicated by the tutorial. Learners were 
supposed to analyse the history of previous factory orders in order to forecast future 
demand, develop a production plan, and then schedule the processing of jobs within the 
factory in order to meet the demand. The tutor sat next to the students as they worked, 
and could interact with them as the student or the tutor felt appropriate. Completing the 
entire scenario required approximately two hours of work, divided into two sessions of 
around one hour. Three video cameras were used: one focused on the learner’s face, 
one focused on the computer screen, and one provided a view of the learner and tutor 
together. This made it possible to track the learner’s actions and focus of attention, as 
well as verbal and nonverbal interactions between the learner and the tutor. The tutor 
was a USC professor who has won awards for teaching excellence, and who uses the 
VFTS in his courses. 

To analyse the interactions, and use them in designing learner-agent dialog, we tran- 
scribed them and aimotated them using the DISCOUNT scheme [19]. DISCOUNT 
represents the structure of educational dialogs as a series of episodes, each pertaining to a 
particular topic. Episodes are divided into exchanges between the parties in the dialog, 
which are composed of a series of turns (e.g., initiate, respond, reinitiate). Each turn 
consists of one or more dialog moves, classified according to speech act (hint, support, 
contradict, etc.) and marked with predicate labels that indicate the function of the move 
in the dialog. 
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2.1 Interaction Tactics and Learner Motivation 

A striking feature of these dialogs was that although they involved many episodes 
where the tutor was offering advice as to what to do, in very few cases did the tutor 
give explicit instructions of what to do. Rather, the tutor would phrase his comments 
so as to subtly engage the learner’s interest and motivation, while leaving the learner 
the choice of what to do and how. These include hints phrased as questions, e.g.: 
Tutor: Want to look at your capacity? 

The tutor’s comments often would reinforce the learner’s sense of being an active par- 
ticipant in the problem solving process, e.g., by phrasing suggestions as activities to be 
performed jointly by the tutor and the learner, e.g.: 

Tutor: So why don’t we go back to the tutorial factory. . . 

Following the work of Sansone, Flarackiewicz, and Lepper and others [25, 12], we 
analyze these comments as intended to influence learner intrinsic motivation. Learn- 
ers tend to learn better and more deeply if they are motivated by an internal interest 
and desire to master the material, as opposed to extrinsic rewards and punishments 
such as grades. Researchers in motivation have identified the following factors as 
conducive to intrinsic motivation among others: 

Curiosity in the subject matter. 

An optimal level of challenge — neither too little nor too much. 

Confidence, i.e., a sense of self-efficacy, and 

A sense of control - being free to choose what problems to solve and how, as 
opposed to being told what to do. 

The tutorial comments observed in the dialogs tend to be phrased in such a way as to 
have an indirect effect on these motivational factors, e.g., phrasing a hinted action as a 
question reinforces the learner’s sense of control, since the learner can choose whether 
or not to answer the question affirmatively. These motivational factors in turn are 
closely linked to learner affect, e.g., confidence and optimal challenge reduce the fear 
of problem solving failure and increase the satisfaction of success. 

Although these comments indicated that the tutor was sensitive to the learner’s mo- 
tivational state, and post-session interviews confirmed this, there were hardly any 
instances in the dialogs of explicit comments aimed explicitly and solely at influencing 
learner motivation, such as “Good job!” or “You can do it!” To model this type of 
interaction in a guidebot, it was clearly necessary to develop a dialog generation 
model that would allow learner motivation to have a pervasive influence, without 
requiring a separate repertoire of tactics with purely motivational intent. 



3 Generating Interaction Tactics 

Based upon these analyses, Sander Kole and Wauter Bosma developed a natural lan- 
guage generator for producing appropriate interaction tactics. The generator takes as 
input a set of language elements - short noun phrases and short verb phrases in the target 
domain - and predicates describing the desired dialog move. It chooses an utterance 
pattern that matches the dialog move predicates most closely, instantiates it with the 
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language elements, and synthesizes an utterance, which is then passed to the guidebot 
persona for uttering using text-to-speech synthesis. 

The underlying generation scheme utilizes DISCOUNT as a way of classifying dialog 
moves. It operates on a set of move templates, each of which includes a set of 
DISCOUNT predicates and a template for expressing the move in natural language. The 
templates specify slots for language elements, which are filled from the language ele- 
ments supplied to the generator. The move templates and language elements are speci- 
fied using an XML syntax and all defined in one language definition file. Fig. 1 shows an 
example move from the language definition file. The moves are based upon utterances 
found in the dialog transcripts; the comments at the top of the move template show the 
original utterance and the transcript and time code where it was found. The move tem- 
plate may classify the move in multiple ways, reflecting the fact that the same utterance 
may have multiple communicative roles, and different coders may code the same utter- 
ance differently. 

<move> 

<!-7SPl 47:11 T -> 

<!— So number 2, the number of seasons may not be 2 then.— > 

<predicate role="initiating" move="all" name="actionl"/> 

<predicate role="initiating" move="all" name="nounr7> 

<predicate role="initiating" move="hint" name="suggest'7> 

<predicate role="initiating" move="inform" name="identify'7> 

<predicate role="initiating" move="reason" name="explain'7> 

<template> 

So <nounphrase case="object" type="parameter" 
name="nounl .nounphrase 1'7> 

may not 

<verbphrase type="parameter" name="actionl.actionl" 
form="infmite'7> 



</template> 

</move> 



Fig. 1. An example dialog move template 

Using this generation framework, it is possible to present the same tutorial com- 
ment different ways. For example, a suggestion to perform an action, such as saving 
the current factory description, can be stated either directly (e.g., “Save the factory 
now”), as a hint, (“Do you want to save the factory now?”), as a suggestion of what 
the tutor would do (“I would save the factory now”), as a suggestion of a joint action 
(“Why don’t we save our factory now?”), etc. 
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4 Politeness as a Framework for Selecting Interaction Tactics 

Having defined this set of dialog moves and implemented a generator that can produce 
them, the next challenge is to determine which tactic to employ in which circum- 
stances. How does the choice of interaction tactic depend upon the learner, the topic 
being discussed, and the state of the social interaction between the learner and the 
tutor? 

The politeness theory of Brown and Levinson [5] helps provide a rationale for 
these decisions. In this theory, social actors are motivated by a set of wants, including 
social wants: negative face, the want to be autonomous and unimpeded by others, and 
positive face, the want to be desirable to others. Common speech acts between social 
actors, such as requests and offers, can threaten the hearer’s negative face, positive 
face, or both, and therefore are referred to as face threatening acts (FTAs). Speakers 
use various types of politeness strategies to mitigate face threats, depending upon the 
severity, or weightiness, of the potential face threat. 

The following are some examples of politeness strategies in the context of tutoring 
a student in operating the VFTS. Consider a critique of the learner such as “You 
didn’t save your factory. Save it now.” This is an example of what Brown and Levin- 
son term a bald on record FT A; there is no attempt to use politeness to mitigate the 
face threat. There are two types of face threat in this example: the criticism of the 
learner’s action is a threat to positive face, and the instruction of what to do is a threat 
to negative face. 

There are various politeness tactics that can be employed to mitigate the face threat. 
One is simply to avoid the face threatening act altogether if the cost of making the 
threat is greater than the potential benefit. In the above case the tutor could omit the 
criticism of the learner and focus on the suggested action, i.e., to save the factory. 
Alternatively the tutor could perform the face-threatening act off record, i.e., so as to 
avoid assigning responsibility to the hearer. An example of this would be “The fac- 
tory parameters need saving.” The face threat of the instruction can be mitigated 
using negative politeness tactics, i.e., phrasing that gives the hearer the option of not 
following the advice, e.g., “Do you want to save the factory now?” Positive politeness 
strategies can also be employed that emphasize common ground and cooperation 
between the tutor and learner, e.g., “How about if we save our factory now?” Other 
positive politeness strategies include overt expressions of approval, such as “That is 
very good,” however as noted above such strategies were uncommon in the interac- 
tions that we studied. 

In the Brown and Levinson model evaluation of face threat depends upon several 
factors. First, the relative weightiness of different face threats is culturally dependant. 
The weightiness of a face-threatening act also depends upon the relative power P 
between the speaker and the listener. Tutors generally have power relative to learners, 
so we would generally expect tutors to make use of weaker politeness strategies when 
speaking to learners than the learners use in reverse. Finally, the weightiness of a face 
threat depends upon the social distance between the two parties. As two people inter- 
act over time their social distance often decreases, reducing the severity of face threat- 
ening acts and increasing the likelihood that bald-on-record strategies will be used. 
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Although the Brown and Levinson model is not specifically aimed at modeling tu- 
torial dialog, it provides a good means of accounting for variability in tutorial dialog. 
The interaction tactics observed in the recorded dialogs, when other than bald-on- 
record statements, have the effect of mitigating face threats. Since offers of advice 
and requests to perform actions are face threatening acts, the theory predicts that tutors 
will employ face mitigation strategies for these kinds of interactions, but not for other 
dialog moves such as explanatory comments. This is consistent with the observed 
data. The theory predicts that the incidence of face threat mitigation strategies will 
decrease as tutor and learner interact for longer periods of time. This trend is also 
observable in the data that we have collected; the incidence of bald-on-record tactics 
was greater in the follow-on tutorial sessions than in the initial sessions. 

Although politeness theory and motivation theory come out of distinct literatures, 
their predictions regarding the choice to tutorial interaction tactics are broadly con- 
sistent. This is not surprising, since the wants described by politeness theory have a 
clear motivational aspect; negative face corresponds to control, and positive face 
corresponds somewhat to confidence in educational settings. To bring the two frame- 
works together, we extend the Brown & Levinson model in certain respects. First, 
whereas Brown & Levinson’s model assigns a single numeric value to each face 
threat, we extend their model to consider positive face threat and negative face threat 
separately. This enables us to select a redressive strategy that is appropriate to the 
type of face threat. For example, if an FTA threatens negative face but not positive 
face, then the politeness model should choose a redressive strategy that mitigates 
negative face threat; in contrast the basic Brown & Levinson model would consider a 
redressive strategy aimed at positive face to be equally appropriate. Second, we allow 
for the possibility that the tutor might wish to explicitly enhance the learner’s face, 
beyond what is required to mitigate immediate face threats. For example, if the tutor 
judges that the learner needs to feel more in control, he or she will make greater use of 
redressive strategies that augment negative face. 

Altogether, the amount of face threat redress is determined by the following for- 
mulas, which are slight elaborations of the weightiness formulas proposed by Brown 
& Levinson [5]: 

W,^ = D(T,S)-P(T,S) + R,^ + A^ 

W,. = D(T,S)-P(T,S) + R,.+ A_. 

Here W.^+ and W^. are the amounts of positive and negative face threat redress, respec- 
tively, T represents the tutor and S represents the student. D(T,S) is the social dis- 
tance between the tutor and the student, and P(T,S) is the amount of social power that 
the tutor has over the student. R^+ is the inherent positive face threat of the communi- 
cative act (e.g., advising, critiquing, etc.,), R^. is the inherent negative face threat of 
the act, A+ is the amount of augmentation of positive face desired by the tutor, and A . 
is the desired augmentation of learner negative face. 

Additional factors clearly need to be taken into account besides politeness theory in 
order fully account for the influence of interaction tactics on learner motivation. For 
example, politeness theory per se does not explain the relative infrequency of com- 
ments aimed solely at positive face, e.g., expressions of praise. In our analyzed dia- 
logs, positive praise is confined to the ends of VFTS sessions, when the learner has 
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completed the assigned tasks. One way to account for this is to note that learners are 
motivated not just by positive face, i.e., to be approved of by the tutor, but by a desire 
for self-efficacy, i.e., to approve of their own performance. Since VFTS tasks take a 
long time to complete, it is difficult to tell whether the learner is doing well until after 
the learner has worked on the task for a significant amount of time. If a learner recog- 
nizes this, then frequent praise from the tutor might be regarded as insincere. This is 
an account that needs to be tested in other domains, where there are there are more 
frequent opportunities to evaluate learner work. 



5 Implementing the Politeness Model 

Mattijs Ghijsen and Herwin van Welbergen have developed a politeness module that 
implements the politeness / motivation model described above, and interfaces to the 
natural language generator. The combined dialog generator takes as input the desired 
utterance type, language elements, and a set of parameters governing face threat miti- 
gation (social distance, social power, and motivational support) and generates an ut- 
terance with the appropriate degree of face threat redress. 

The utterance types are classified in accordance with Bloom’s taxonomy of educa- 
tional goals [4]. Bloom categorizes instructional actions into three groups: cognitive, 
concerning the development of intellectual abilities and skills; affective, comprising 
interests, attitudes, and values; and psycho-motor, regarding the manipulative or mo- 
tor-skill area. The three objectives most relevant to the VFTS are from the cognitive 
category: Knowledge, i.e. the recall of specifics, universals, methods and processes - 
such as mastering the concept of forecast, or the planning process; Application, i.e. the 
use of abstractions in concrete situations -for instance the application of a specific 
forecasting method to the simulated factory; and Synthesis, i.e. the putting together of 
elements and parts so as to form a whole -such as producing a plan of operations to 
perform on the VFTS interface. 

These cognitive goals, applied to the set of interface objects in the VFTS interface, 
and to the concepts and tasks described in the tutorial materials for the VFTS, deter- 
mine the set of possible communicative acts that the dialog generator needs to gener- 
ate. The repertoire of utterance patterns and language elements was extended as 
needed in order to cover this set. 

To choose the appropriate interaction tactic, politeness generator first computes 
target positive and negative politeness values for the desired utterance. The positive 
and negative politeness values are computed in order to counteract the weightiness of 
the face threat, as well as to achieve additional motivational influence, as indicated in 
the formulas in the previous section. Social distance, social power, and motivational 
influence are all parameters that are supplied to the politeness generator and are po- 
tentially adjustable. 

Once the target politeness values are chosen, the generator chooses from a library 
of natural language templates one that matches the target politeness values most 
closely. Each template, as in Figure 1, is assigned a positive and negative politeness 
value. A template is chosen that minimizes the sum of the distances between desired 
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and chosen politeness values, for both positive and negative politeness. When multi- 
ple templates have an appropriate politeness value one is chosen that matches the 
greatest number of move predicates. 

To apply this politeness module, it is necessary to assign politeness values to each 
template in the library. To assign these values, we grouped templates according to the 
politeness strategies that they exhibit, as shown in Table 1. These categories were 
derived from analysis of the background dialog transcripts, and then mapped onto the 
general strategies identified by Brown & Levinson such as bald-on-record. We then 
assumed that all templates employing a given strategy should be assigned the same 
level of politeness — not strictly true, since perceived politeness depends upon context, 
but adequate as a first approximation. We then created a questionnaire containing 
examples of different politeness strategies, and had subjects evaluate each example in 
terms of negative and positive politeness. The mean negative and positive politeness 
scores were then assigned to the templates in the library. 

Table 1. Face threat redress strategies for different utterance types 



Utterance Type 


Politeness strategies 


Suggest action 


Bald on record, conventional indirectness, joint 
goal, student goal, question, suggestion, tutor goal 


Explain concept 


Bald on record, positive politeness, attend to 
hearer, student’s goal, impersonalize, off record 


Explain tutorial 


Bald on record, tutor goal, joint goal, suggestion 


Suggest interface op- 
eration 


Bald on record, conventional indirectness 


Explain interface object 


Bald on record 


Socratic hint 


Socratic hint 


Action feedback 


Bald on record, positive politeness 



The politeness module and dialog generator are used as part of an overall peda- 
gogical agent architecture, which includes other modules to detect learner focus of 
attention, including eye gaze tracking, and to recognize plans that the learner is carry- 
ing out, as described in [22]. This information can be provided to a remote tutor, 
operating a Wizard-of-Oz interface, so that the tutor can decide when it is appropriate 
for the guidebot to interact with the learner; we are also developing a control module 
that can make these decisions automatically. Either way, the interactions are realized 
by the dialog generator, which selects appropriate dialog moves in accordance with 
the social distance and the intended effects on learner motivation. The synthesized 
utterances are then output by the agent persona, by means of a text-to-speech genera- 
tor and persona control system [24]. 
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6 Related Work 

There is a growing body of research relating to the expression of affect-related states 
in animated agents in animated agents (e.g., [6; 14; 21]. There also is some work in 
recognizing user affect [18; 28]. However following the work of theorists such as 
Lazarus [11] we draw a distinction between emotions, on one hand, and attitudes and 
mental states that engender emotions in certain circumstances, and focus on the latter. 
The focus here is help make learners confident and motivated, rather than make them 
happy per se, although the expectation that a favorable motivational state will lead to 
favorable emotional states. 

Although there is relatively little work on learning systems designed to detect and 
influence learner motivational state, the topic is beginning to attract more interest, 
particularly in the work of del Soldato et al [26] and de Vicente [27]. Heylen et al. 
[17] highlight the importance of affective and motivational factors in tutors, and ex- 
amine the interpersonal factors that should be taken into account when creating so- 
ciably intelligent computer tutors. Baylor [2] has conducted experiments in which 
learners interact with multiple pedagogical agents, one of which seeks to motivate the 
learner. User interface and agent researchers are also beginning to apply the Brown & 
Levinson model to human-computer interaction in other contexts [6; 15]. 

Porayska-Pomsta [20] has also been using the Brown & Levinson model to analyze 
teacher communications in classroom settings. Although there are similarities be- 
tween her approach and the approach described here, her model makes relatively less 
use of face threat mitigating strategies. This may be due to the differences in the so- 
cial contexts being modeled: one-on-one coaching and advice giving is likely to result 
in a greater degree of attention to face work. 



7 Conclusions and Future Work 

In this paper we have presented an approach to generating coaching dialogs charac- 
terized by politeness. Real tutors use politeness as a means for respecting the student’s 
social face, and for indirectly fostering his intrinsic motivation. Our goal is to replicate 
the tutor’s behavior by means of a natural language generator coupled with a polite- 
ness module, that chooses templates according to their level of politeness and to a set 
of pedagogical goals relevant to the VFTS. 

Using this framework we now plan evaluations to test the impact of politeness in 
learning settings. We plan to compare guidebots that exhibit politeness against guide- 
bots in which politeness considerations are disregarded, so that all guidebot comments 
are bald on record. We predict that the polite version will be regarded more favora- 
bly, and will result in an improved learner motivational state, compared to a compara- 
ble impolite version. 
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Abstract. As computer systems are evolving and coming to be regarded as 
social actors, the importance of social intelligence that enables natural and 
socially appropriate interactions is gaining a growing interest among the 
human-computer interaction researchers. This article discusses the definition, 
importance, and benefits of social intelligence as agent technology. It then 
describes a collaborative learning system that incorporates agents that are 
equipped with a social intelligence model. We argue that socially appropriate 
affective behaviors provide a new dimension for collaborative learning systems. 
The system provides an environment in which learning takes place through 
interactions with a coaching computer agent and a co-leamer, an autonomous 
agent that makes affective responses. The social intelligence model that handles 
affective responses is based on psychological theories of personality, emotion, 
and human-media interaction, such as appraisal theory and the Media Equation. 
Experiments conducted with this collaborative learning system to examine the 
effect of the social intelligence model suggested that users had more positive 
impressions about the usefulness, the application, and their learning experience 
when the co-learner agent displayed social responses with personality and 
emotions than when it did not express them. It should be noted here that the co- 
learner agent did not provide any explicit assistance for the learner, such as 
giving clues and showing answers, yet it influenced the user’s evaluation on the 
usefulness of the learning system. Experimental data also suggest that the co- 
leamer agent contributed to the effectiveness of the learning system. 
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1 Introduction 

A recent new direction of research in human-computer interaction is to integrate 
sociality into human-computer interaction. As computer systems are becoming more 
and more ubiquitous, they are evolving from computing machines to communication 
machines. A speech-enabled automated system is an example of such machines. In 
this trend, computer systems are coming to be regarded as social actors rather than 
simply as tools. The ability involved with social aspects of interactions may be 
referred to as ‘social intelligence’, which makes contrast with ‘rational intelligence’ 
[9]. ‘Rational intelligence’ refers to the kind of intelligence that artificial intelligence 
has traditionally dealt with. According to Katagiri [9], it is goal-oriented intelligence 
that enables one to find an answer to a question, generate a solution to a problem, and 
so on. ’Social intelligence’ is the type of intelligence that enables one to share 
information and feelings with others, and behave in such a way one is accepted as a 
member of a community. 

Our research is concerned with the development and application of social intelligence 
for computer-assisted learning. Education, as traditionally interpreted through 
classroom-based school systems, is a social activity in many respects, particularly in 
that learning is mediated and influenced by the content and manner of social 
interactions between teachers and students, and among students. We argue that 
affective aspects of those social interactions can be utilized to make learning activities 
more effective and beneficial. In recent years, there has been a major shift in the 
paradigm of computer-supported learning. The research efforts on learning systems 
have traditionally been centered on individualized environments where a single user 
interacts with the computer system, which often serves as a teacher or a tutor. The 
new trend is an emphasis on collaborative learning environments [6, 8]. There are two 
kinds of benefits for learning in the collaborative learning environment. One is what 
is often called ‘learning by teaching,’ in which one can learn given knowledge by 
explaining it to another learner. The other benefit is often called ‘leaning by 
observation,’ in which one can learn given knowledge by observing another learner 
working on problem solving, teaching other learners, and so on. 

While in these approaches to collaborative learning, learning takes place in the 
interactions between the learners, the kind of intelligence these approaches are 
primarily concerned with is knowledge-based, goal-oriented, and rational, and thus 
social intelligence might only be utilized as a side effect. In contrast, our approach 
attempts to make use of affective and social intelligence in a more direct manner. 

In this article, we discuss the theoretical model of social intelligence and our 
implementation of it in a collaborative learning system. We then present a report of an 
experiment conducted to examine the effects of the social intelligence model in the 
learning system. 



2 Collaborative Learning System 

We have developed an agent-based learning system incorporating social intelligence 
for human-machine interaction. We call it the e-School system. In addition to the 
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social intelligence model, a distinguishing 
characteristic of our system is the introduction of an 
embodied co-leamer agent. The co-learner, as its 
name indicates, is learning alongside the human 
student, has no additional knowledge on the subject 
nor explanations to provide, and is also lacking 
“ulterior” motives - such as eliciting nuanced 
explanations from the student - as is the case with 
some intelligent tutoring systems (for example, see 
Goodman, et ah, [6]). 

Embodied, interactive computer characters have 
been shown to be effective teachers by increasing 
interest and transfer of knowledge [13], yet these 
studies have often focused the nature of interaction in 
a one-to-one interaction between the teacher agent, 
assuming roles such as tutor or coach, and the human 
learner. We argue that the presence of an embodied 
co-leamer character provides increased richness in the 
social interaction space, which leads to increased 
learning. 

As shown in Figure 1, our learning system employs a classroom metaphor, with the 
traditional chalkboard. There are three cartoon characters in the classroom: a teacher 
agent, a student agent, and an avatar for the human learner. In Figure 1, the teacher 
agent is represented by an owl, shown to possess wisdom attributes across cultures, 
the co-learner agent is represented by a tiger, and the human learner has chosen the 
representation of a cat. Interactions occur between the teacher and each learner, as 
well as between the learners. Furthermore, one type of interactions may influence the 
other type of interactions. For example, when the teacher gives the co-leamer 
feedback to its answer, the human learner may want to comment on it. 

The agents are equipped 
with the social intelligence 
model that controls their 
behaviors, including emo- 
tional expressions. The 
avatar’s behavior is partially 
autonomous and can also be 
directly manipulated by the 
human learner. Each agent 
has its own profde that 
defines the attributes of its personality and the role it plays (e.g., teacher, co-leamer). 
The social intelligence model uses the profile information to generate the agents’ 
behaviors. Although the user may only click on appropriate answers and emotions, 
the agents and the avatar use facial expressions (see Figure 2) as well as text output to 
communicate with each other. A chalkboard is available for the teacher agent and the 
students to present learning materials. The cuiTent version (version 1) of the e-School 
system has the functionality that meets the minimum requirements for demonstration 
and the experiment reported below. The social intelligence model implemented in this 
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version is still simple, but is complex enough to study some affective factors such as 
liking, emotion, and personality. Though not yet implemented yet in the current 
system, in terms of the architecture, it is possible to place more agents and avatars in 
the environment. 



2.1 Theoretical Model of Social Intelligence 

In this section, we discuss the theoretical bases for the social intelligence model 
implemented in the e-School characters. Unlike some intelligent tutoring systems, e- 
School uses the character’s emotional models to engage the human learner, and 
motivate him or her to learn the content, rather than modeling his/her affective states 
and understandings. Appraisal theories of emotion and most emotion theorists in 
general agree that emotion is essentially a reaction to events deemed relevant to the 
needs and goals of an individual [3]. At the most basic level, the achievement of goals 
leads to happiness, and the failure to achieve goals leads to sadness: this is the current 
level employed in our system. 

Various versions of appraisal theory exist which specify in greater detail the 
critical properties of a goal-relevant event that lead to specific emotions (e.g., 
frustration, anger, hope, etc.). Goals can take many forms, but Maslow’s Hierarchy 
provides a useful way of categorizing these goals. In his later work, Maslow [12] 
grouped an individual’s basic needs into eight categories: 

Physiological: hunger, thirst, bodily comforts, etc. 

Safety/security: being out of danger 
Social: affiliate with others, be accepted 

Esteem: to achieve, be competent, gain approval and recognition 
Cognitive: to know, to understand, and explore 
Aesthetic: symmetry, order, and beauty 

Self-actualization: to find self-fulfillment and realize one's potential 
Transcendence: to help others find self-fulfillment and realize their potential. 

In the e-School system, we focus on social and esteem goals. We also arguably take 
cognitive goals into account, but we do not distinguish these from esteem goals. A 
strong argument can be made that social, esteem, and cognitive goals are the most 
appropriate for artificial agents given that computer agents do not have true physical 
form (so do not need to worry about food or safety issues) and do not have enough 
intelligence to be concerned with aesthetics or higher level goals. 

Unlike appraisal theory, current 2-factor theories of emotion [10] do not focus on 
what causes emotion, but instead argue that all emotions can be fully described using 
two dimensions: valence (positive/negative) and arousal (low/high). The e-School 
social intelligence can be seen to employ 2-factor theory, in addition to appraisal 
theory, since our current emotion model includes valence (as a dichotomous variable) 
and arousal (as a continuous variable). 

Brave [2] extends appraisal theories of emotion to develop a new theoretical 
framework for empathy. The basic idea is that when we care about another person 
(i.e., value their welfare), we may take on some of their goals as if they were our own. 
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Such “adopted” goals may now lead to emotion just as any of our own personal goals. 
When we react emotionally to adopted goals, it is called “empathy.” 

To illustrate, let us assume that student #1 self-formulates the goal of answering a 
question correctly. Based on appraisal theory, this student will become happy if 
successful and sad if not. If a second student (student #2) is a good friend of student 
#1, then student #2 will also want student #1 to get the answer correct; in other words, 
student #2 will “adopt” student# I’s goal because he/she cares about student #1. Now 
student #2 may also become happy or sad depending on whether student #1 answers 
correctly or incorrectly, an empathic response. Although not discussed in the 
literature, it is reasonable to imagine that the intensity (arousal) of the emotion 
empathically experienced as such is related to the degree of caring or friendship for 
the target of empathy. In the e-School environment, we can have the co-learner agent 
behave as if it had empathy toward the human learner, displaying that it cares about 
him/her. Then, it would follow that if these behaviors are done appropriately, they 
could impact the human learner to achieve their goals. 

Interpersonal attraction (including liking and affiliation) is considered to be a 
fundamental component of social psychology. The two employed in the e-School 
social intelligence are the reinforcement-affect model [4] and balance theory [7]. 
According to the reinforcement-affect model, likes and dislikes are based on whether 
positive or negative feelings are aroused. We like any rewarding stimulus because of 
our positive feelings, and we dislike any punishing stimulus because of our negative 
feelings. When a stranger (or any other neutral stimulus) is present at the same time, 
that person becomes associated with the positive or negative feelings. This 
conditioning results in liking for any stranger associated with positive feelings and 
dislike for any stranger associated with negative feelings. This reinforcement-affect 
model is strongly related to the “don’t kill the messenger” phenomenon [14] that says 
that the receiver of a message often orients emotionally toward the most proximal 
source of a message (i.e., the message deliverer) rather than the true source of the 
message, who may be at a distance. 

Combined with emotion appraisal theory, the reinforcement-affect model can also be 
used to explain the phenomenon of liking reciprocity [1, 5], which says that we tend 
to like people who like us. Being liked meets our social goals, which according to 
appraisal theory makes us happy. The reinforcement-affect model then says that we 
will like the person that made us happy, and so we get reciprocal liking. 

Balance theory of Heider [7] extends this notion of liking reciprocity to groups of 
three (triadic reciprocity). If person #1 likes person #2, then person #1 tends to like 
those things (or people) that person #2 likes. If, on the other hand, person #1 dislikes 
person #2, then person #1 tends not to like those things (or people) that person #2 
likes. Effects can also occur in the reverse direction: if two people like the same thing 
(or person), they tend to like each other. If, on the other hand, they disagree about 
some thing (or person) they tend to dislike each other. 

Although caring (as discussed above in relation to empathy) can be considered a 
social-psychological construct distinct from liking, the two are often very highly 
correlated. As such, a modeling approximation used in the e-School intelligence is to 
consider them the same dimension. 
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3 Experiment 

This section presents the experiment that we conducted to examine the effect of the 
social intelligence model in the collaborative learning system described above. For 
this initial testbed, our domain was English idioms for intermediate learners of the 
language, in particular, Japanese students. The lesson consisted of fifteen questions 
across three levels of difficulty. Questions were of the single-correct or multiple- 
choice format, drawn from standard English idiom teaching materials, such as 
“choose all the answers that apply,” “choose the correct answer,” and “choose the 
incorrect interpretation.” Instructions for answering each kind of question were 
clearly presented before each section. 

The entire lesson format followed a sequence where the teacher agent presents a 
question, the learners’ choose a response from available choices, the teacher provides 
feedback on the response and concludes with further explanations on the question. For 
example, a sample question within the “easy” category in this format was presented 
as: 

[Instructions:] Teacher Agent Says: 

“What do you think the following statement means:” 

[Statement:] “Lucy is a pain in the neck!” 

[Chalkboard displays answer choices]: 

Possible Interpretations: 

a) “Lucy hurt her neck.” 

b) “Lucy makes my neck hurt.” 

c) “Lucy is annoying.” 

d) I don’t know. 

[Learners choose their answer] 

Correct Answer: (c) 

[Teacher grades the answer:] Teacher Agent Says: 

“That is correct/incorrect” 

[Follow up Explanation:] Teacher Agent Says: 

“Calling someone a pain in the neck is not very nice - and 
something people tend to say only when the “pain in the neck” person is 
not present. It means they are annoying or troublesome to you, perhaps 
because they ask too many questions or put too many demands on your 
time.” 



3.1 Method 
Participants: 

A total of seventy-seven (77) undergraduate students at International Christian 
University (ICU) participated in the experiment. All participants were native speakers 
of Japanese learning English. 
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Design: 

Three versions of the application were prepared for the experiment: 

a) No co-leamer agent (No Agent Condition) 

On the interface, there were only the teacher agent and the avatar. 

b) Co-learner agent without social intelligence model (No Social Model 
Condition) 

On the interface, there were the teacher agent, the co-leamer agent, and the 
avatar. The co-leamer agent did not have social intelligence and as such it 
did not display any emotional or social responses, having what is often 
referred to as a neutral or ‘poker face.’ 

c) Co-learner agent with social intelligence model (Social Model Condition) 

On the interface, there were the teacher agent, the co-leamer agent, and the 
avatar. The co-leamer agent was controlled by the social intelligence model. 
It had friendly and mildly dominant or confident personality, and graphically 
displayed emotional and social responses depending on the human learner’s 
performance in class. 

Procedure: 

The experiment was conducted in ICU’s language lab, where all the participants 
participated in the experimental session together. Each participant was randomly 
assigned to one of the conditions. The participants first received the instructions on 
how to work with the application verbally and in writing. Following the self- 
enrollment procedure, the participants had the experimental session of learning 
English idiomatic expression using the application. The session was conducted in the 
following way. For each problem, the teacher agent presented a question, showing it 
on the blackboard. Then the human learner made a response, by choosing one of the 
alternatives. In this case, the response was not disclosed to the teacher or the co- 
leamer. In other words, the learner simply thought of an answer in her mind. The 
teacher either called on the human learner’s avatar or the co-leamer agent to answer. 
The teacher agent then gave feedback (positive or negative). If the human learner had 
answered, the co-leamer agent made a reaction to the situation. If the co-leamer agent 
had answered, the human learner was given a few possible reactions to choose from. 
Then, the teacher agent gave a brief explanation about the expression independent of 
the student choice; that is, whether the student answered correctly or incorrectly, s/he 
will hear and see the idiom explanation. Thus, every participant is privy to the same 
amount of information, presented in the same order, independent of the rightness of 
his or her answers. 

The relationship between the human learner and the character co-learner is one of 
peer-to-peer, without the hierarchical relationships that additional knowledge or 
mastery of the language that other language learning character-based systems use in 
practice coaches (see for example, Extempo’s FLOW system [11]. 

After the teaching session was over, the participants responded to a questionnaire that 
presented as a web page and displayed on the computer’s internet browser. It included 
questions concerning impressions on the application, the co-leamer agent, and user 
experience. A short quiz - also online - followed to measure the participants’ 
performance on recall and recognition of the idiomatic expressions presented in the 
lesson, selecting from thirty tme/false and eleven fill in the blank questions. The 
experiment took approximately an hour for each participant. 
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3.2 Results and Discussion 

Quiz Results and Self-assessment of Learning 

The effectiveness of the application was first examined based on the results of the 
content quiz, which are shown on Table 1. The participants in the two co-leamer 
conditions (Social Model and No Social Model) attained higher scores of correct 
answers in the forty-one content questions than those in the no-agent condition. 
Furthermore, students in the Social Model Condition left a fewer number of questions 
blank than students in the other two conditions. These results seem to indicate that the 
co-leamer produced greater retention of the lesson, and that participants were more 
motivated to answer the questions, even guessing their response, when they worked 
with the co-leamer agent with social intelligence model. 

In addition to the content quiz, the post-experience questionnaire assessed how 
much (in a 10-point scale) the participants felt they had learned the materials. The 
results are shown on Table 2. The statistical analysis showed that the participants’ 
self-assessment of their learning was significantly higher for the co-leamer conditions 
(Social Model and No Social Model) than for the No Co-leamer Condition [F(l, 76) = 
5.32, p < .05]. This pattern of results indicates that the participants had more positive 
impressions about their experience and the content of learning when they worked with 
a co-leamer agent. 



Table 1. Quiz Results (Ratio to the total number of responses) 



Condition 


Percentage of Correct 
Responses 


Percentage of 
Responses left blank 


No Co-leamer 


51% 


27% 


Co-leamer without 
Social Intelligence Model 


60% 


22% 


Co-leamer with Social 
Intelligence Model 


69% 


07% 



Flowever, the difference between the two co-leamer agent conditions (Social 
Model and No Social Model) did not turn out to be significant [F(l, 76) = .26, n.s.]. 
Does this mean that the social intelligence model had no effect on the students’ 
learning experiences? To examine this question, further analyses were performed on 
other aspects of the participants’ learning experiences and impressions. 



Table 2. Self-assessment of Idiom Learning 





No Agent 


Agent (No Social 
Intelligence Model) 


Agent (Social Intelligence 
Model) 


Mean 


3.86 


4.96 


5.28 


SD 


2.08 


2.34 


2.31 



Evaluation of Co-learner Agent 

We first tested whether the participant’s impression of the co-leamer’s characteristics 
(evaluation in a 10-point scale) varied depending on the presence or absence of social 
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responses. The analysis showed that the Social Model Condition marked significantly 
higher scores than the No Social Model Condition in the following evaluation items: 

(1) Cooperativeness [F(l, 51) = 13.82, p < .01] 

(2) Trustworthiness [F(l, 51) = 8.56, p < .01] 

(3) Feels warm [F(l, 51) = 6.6.5, p < .02] 

These results showed that the social intelligence model and behaviors generated by it 
had more positive influence on the impressions of the co-leamer agent. 



Evaluation of Leaning System 

As stated above, the purpose of this study was to investigate the effect that the 
affective aspects of social intelligence may have on the usefulness and impression of 
the learning system. The question we should ask is how those impressions and 
evaluations that the participants had about the agents influenced the evaluation of the 
application as a whole. In other words, how could the social intelligence model affect 
the evaluations of the characters, and in turn have an effect on the evaluation of the 
learning system itself? 

Before broaching into this question, there is one factor that we should take into 
consideration. The analysis indicated that under both of the Social Model Condition 
and the No Social Model Condition, the impressions on the application was dependent 
on how the participant was attracted by the agent [F(l, 51) = 12.16, p < .01]. This 
means that regardless of the presence of social intelligence, liking varied from person 
to person. That is, some individuals liked the agent without social responses more 
than others and some liked the agent with social responses less than others, depending 
on their personal preferences. Similar results have been observed with characters 
based on personality similarities and differences, indicating the possibility that this 
preference may be generalizable based on each participant’s personality 
characteristics (see Reeves and Nass [14]). To control for this personal preference 
effect in the analysis of the effect of the social intelligence model, an analysis of 
covariance was performed with personal preference on the agent as a covariate. The 
result revealed that the main effect of the social model on the evaluation of the 
application was found in the following items: 

(1) Ease of use [F(l, 51) = 6.11, p < .02] 

(2) Satisfaction (Would recommend it to others) [F(l, 51) = 4.82, p < .04] 

(3) Pleasantness in learning (Not frustrated) [F(l, 51) = 6.62, p < .02] 

(4) Supportivenss [F(l, 51) = 3.28, .05 < p < .08] 

For these evaluation items, the Social Model condition marked higher scores than the 
No Social Model condition. 

These results seem to suggest that the social intelligence model and the social 
responses (i.e., friendly and confident) generated by the model had positive impact on 
the impressions on the learning system as a whole. 

Finally, some may argue that it was not the model-based affective responses that 
impacted on the participants. In other words, as long as they are friendly, random and 
superficial responses could do as well as the model-controlled social responses. To 
this question, the present experiment does not offer a definitive answer since the study 
compared the agent with the social intelligence model and the one without. Flowever, 
a closer analysis on the No Social Model Condition revealed the potential mismatch 
between the anthropomorphism suggested by the animated appearance of the co- 
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learner and the lack of socially appropriate responses. This mismatch in the No Social 
Model Condition could violate the participant’s expectations and lead to negative 
impressions. The analysis of covariance with the system impression as a covariate 
showed that the participants felt it was significantly less enjoyable to work with the 
system in the No Social Model Condition, when compared both with the Social 
Model Condition and the No Agent Condition [F(l, 76) = 5.25, p < .03]. This result 
seems to indicate that the lack of appropriate affective responses of an embodied 
agent is worse than the absence of an agent. In light of this finding, we would argue 
that participants may not respond to the superficial level of affective responses of an 
agent (i.e., randomly generated affective response) rather showing greater sensitivity 
to the naturalness of such response patterns. In that regard, the social intelligence 
model plays an important role in the interaction. 

In summary, the results of the experiment provided support for the following 
assertions: 

(1) The co-leamer agent contributed to the effectiveness of the learning 
system. 

(2) The users had more positive impressions about the usefulness and the 
application and learning experience when the co-learner agent displayed 
social responses indicating personality, emotions, and awareness of the 
human learner’s emotions (empathy). 

It should be noted here that the co-Ieamer agent in the current system did not 
provide any explicit assistance for the learner such as giving clues and showing 
answers. It only made some limited social responses such as praise, encouragement, 
and comfort (typical responses include “that was a hard question!”). Yet, the 
participant’s evaluation on the usefulness of the learning system was influenced by 
such an agent. These results are consistent with the claims that Media Equation theory 
[14] makes. That is, even if the co-learner is an artificial entity like a software agent 
with simple animation, a social relationship will be inferred by the human user, and 
social-psychological behaviors of the user can be influenced according to the nature 
of interactions that takes place on that relationship. 



4 Conclusion 

In this article, we discussed the social intelligence that renders affective behaviors of 
software agents and its application to a collaborative learning system. We argued that 
socially appropriate affective behaviors would provide a new dimension for 
collaborative learning systems. We presented the experiment that tested our 
hypothesis using our implementation of the collaborative learning system. The 
experimental data seem to agree that social and affective agents have a great potential 
to enhance learning. 

As noted above, our system is on the initial stage and the social intelligence model 
is still rather simple and limited. It is encouraging that even such a simple model 
could generate significant effects on certain aspects of the user experience and 
achievement of the goal (i.e., English idiom learning). Needless to say, there are more 
challenges and issues with the development of and research on social intelligence. 
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Obviously, the social intelligence model should be further enhanced. In order for that, 
a better understanding of social intelligence, especially in terms of modeling, must be 
gained. Second, further studies are needed to substantiate our claims on the social and 
affective aspects of collaborative learning in education. There are probably some 
universal characteristics about the nature of interactions and collaborations across 
different educational domains beyond foreign language learning, but there must also 
be differences depending on the educational domain and on the educational goals of 
the application. Different pedagogies may result in applications as widely varying 
within the same domain as those across domains. The better we understand the 
dynamics of social interactions, the better the social model can be. In that regard, 
collaborations among researchers in the related fields such as human-computer 
interaction, artificial intelligence, psychology, and education should be sought out. 
Our team is certainly growing in that direction, and we expect to make more progress 
in the near future. 
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Abstract. Despite the growing attention towards the communication adequacy 
of embodied conversational agents (EGAs), standards for their assessment are 
still missing. This paper reports about a methodology for the evaluation of the 
adequacy of facial displays in the expression of some basic emotional states, 
based on a recognition task. We consider recognition rates and error distribu- 
tion, both in absolute terms and with respect to a human model. As to data 
analysis, we propose to resort to standard loglinear techniques and to informa- 
tion-theoretic ones. Results from an experiment are presented and the potentials 
of the methodology are discussed. 



1 Introduction 

In the last years there has been a great effort in the development of embodied conver- 
sational agents (EGAs) — i.e., artificial agents able to communicate by means of non- 
verbal behaviour (gestures and facial displays) in addition to voice, and to generate 
emotional and conversational behaviour as a function of communicative goals and 
personality [1]. However, despite their growing popularity, standards for EGAs as- 
sessment are still missing. Some of the reasons for the lack a common evaluation 
framework for EGAs can be found in their variety and complexity. They target a large 
variety of domains and applications (e.g. multimodal human-computer interfaces, 
interactive games, tools for hearing impaired, ...), serving many different purposes 
(virtual guides, salesmen or companions, aides to improve the intelligibility of natural 
and synthetic speech, to enhance the visual appearance of an interface, etc.). It seems 
natural that their effectiveness be measured in terms of communication abilities more 
than (or besides) standard usability indexes; but, it is still not clear how this can be 
accomplished, and to which extent this can be done in a task-independent way, to 
achieve generality. On the other hand, it is possible to single out different levels, in- 
cluding the appearance of the EGA, its ‘mind model’, the intelligibility of the gestures 
and emotional displays it produces, etc., which are expected to contribute to the 
EGA’s effectiveness [2]. 

In this paper we focus on the evaluation of a specific aspect of the emotional be- 
haviour of talking faces, i.e. the adequacy of facial displays in the expression of some 
basic emotional states, measuring it through a recognition task. Though being quite a 
low-level (perceptual) assessment, it can provide important insights to developers — 
e.g., by allowing them to compare the performances of different faces based on 
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different technologies, or of the same face at different stages of development. In addi- 
tion, the recognizability of emotional expressions affects communicative behavior; 
hence its assessment is expected to be an important step towards a more comprehen- 
sive evaluation of the communicative abilities of synthetic agents. 

We will start by discussing the evaluation procedure for MPEG4 facial animation 
players suggested by Ahlberg, Pandzic and You [3]. In §3 we attempt to improve the 
methodology, and introduce the details of a recognition experiment we conducted. We 
then discuss the data analysis and some of the results in §4, focusing on correct recog- 
nitions and on the way errors distribute across conditions. Section 5 tries to take stock 
of the work done, and to highlight interesting directions for future work. 

2 The Linkoping Approach 

Ahlberg, et al [3] suggested a benchmark for measuring the subjective quality of a 
face model in terms of how well it can convey emotions (via facial expressions). They 
were interested in investigating how well a given face model can express emotions 
when controlled by low-level MPEG-4 FAPs (Facial Animation Parameters) captured 
from real persons acting out the emotions. The expressiveness was measured by the 
accuracy rate of human observers recognizing the facial expression being shown. 
Subjects were given sequences of videos of human and synthetic faces expressing the 
6 basic emotions of Ekman’s set (fear, anger, surprise, sadness, happiness, disgust) 
[4]. Stimuli consisting of human faces were constructed by recording people acting 
different emotions through video camera {natural video sequences). During the re- 
cordings, the 3-D motion of the head, and a subset of 22 MPEG-4 facial feature points 
were tracked through head tracking equipment and IR-reflecting markers. This al- 
lowed the creation of MPEG-4 FAP files, which were then fed into two different facial 
animation engines to produce the synthetic video sequences. 

The hypothesis was that the recognition rate for each synthetic face is better than 
the random case but worse than the human. In addition, the authors wanted to compare 
synthetic faces. The dispersion matrices containing the frequencies of the given Judg- 
ments for each face were compared to an ideal dispersion matrix (perfect recognition) 
and a random matrix (random recognition). Absolute and relative scores for faces 
were provided, and the t-test was used to check for statistical significance of differ- 
ences: The results showed significant differences between the models and the real 
(natural), ideal and random case; however, significant differences between the two 
face models were not detected. 

It is an important feature of the proposed methodology that the expressiveness of 
MPEG4 facial animation players is compared (also) to that of the humans who pro- 
vided the model: since people make recognition emors when confronted with human 
facial expressions, a data driven approach to face animation is expected to be capable 
of reproducing both the correct and incomect recognitions of the model. 

There are some limitations in the way the study was performed, though. Some are 
related to the reflective dots procedure used to record the data, which is very time- and 
effort-demanding, and requires that several constraints be met for the data to be reli- 
able. Problems of this kind yielded some videos of low quality, which had to be dis- 
carded, producing a different number of video stimuli per each emotion and per each 
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human model. Other possible sources of problems are: the choice of laymen as acting 
people, because of the low consistency and uniformity of the resulting expressions; the 
fact that some video were of different duration, this way introducing another disturb- 
ing factor; the resort to collective rather than individual sessions with subjects. Fi- 
nally, the suggested method provides an easy way to compute absolute and relative 
scores. Flowever, some technical details are not clear (one might object about the 
appropriateness of a t-test for testing significant differences between scores; it is not 
clear the reason for using a randomly generated matrix instead of the more standard 
procedure consisting in making hypotheses on the form of the distribution). More 
importantly, the suggested measure is quite a rough one, for it collapses all the infor- 
mation of confusion matrices in a single score. In particular, correct recognitions and 
errors cannot be told apart, nor it is possible to speculate about the different recognis- 
ability of the various emotions. Yet, especially for comparison and development pur- 
poses, it would be important to have finer-grained analyses addressing both dimen- 
sions — e.g., to understand how close the data-driven method (base on FAPs) mimics 
the human model on each of them. 

3 The Experiment: Objectives and Method 

For our experiment we modified the objective and methodology described in the pre- 
vious section. In particular, we attempted to improve the experimental design and the 
data analysis models, paying specific attention to correct recognitions and errors. As 
to data analysis, we exploited standard techniques for the analysis of categorical data 
(generalised linear models, loglinear analysis [9]) and an information theoretic ap- 
proach to error analysis [10]. 

3.1 Objectives 

The experiment aimed at evaluating expressiveness of two synthetic faces in two dif- 
ferent animation conditions: in the first FAP files (FAP condition) recorded from 
actors were played; in the second, the FAP files were produced from scripts specified 
by the developer. We will refer to the two conditions as the FAP and the script-based 
(SB) one, respectively. The use of SB is motivated by the existence of many talking 
heads that exploit this mode of animation. On the other hand, FAP files directly drawn 
from actors could be expected to give better results in terms of adherence to the 
model. In the data analysis we considered both the absolute merits of a given 
face*mode-of-animation combination and its quality relative to the human model. One 
of the objectives was to assess how close FAP -faces are to the model, and what kind 
of biases and idiosyncrasies, if any, the SB mode could be responsible for. Finally, we 
tackled these goals by considering both recognition rates and the error distributions. 

3.2 Method 

We departed from the Ahlberg et al.’s methodology in the following respects: the 
neutral expression was added to the six emotions of the Ekman’s set; another anima- 
tion condition, the mentioned SB one, was added; recordings (and FAP files) were 
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taken from one professional actor instead of laymen;' great attention was paid to re- 
cording conditions, so that we didn’t have to discard any recordings, managing to have 
the same number of stimuli for each emotion and condition. 

We used two synthetic faces (Face 1 and Face2); each was presented to subjects 
and evaluated in two different conditions: the face playing script-based (SB) emo- 
tional expressions, and the face playing the FAP files extracted from the actor. This 
set up allowed us to: a) evaluate and compare SB approaches, whereby the specifica- 
tions for expressing emotions are provided by the developer, to data-driven (FAP) 
ones, b) pursue the task of cross-face comparison, and c) assess possible interactions 
between faces and mode of animation. 

3.2.1 Video Stimuli 

Preparation of videos went through three steps: recording of an actor uttering a sen- 
tence while expressing different emotional states, production of the related MPEG-4 
FAP (Facial Animation Parameters) files, and animation of the FAPs sequences using 
different synthetic faces. 

The actor (male, 30 years old) was recorded through the Elite system [4], which 
uses two cameras with a frame rate of 100 Hz to capture 28 markers. Two synthetic 
3D face models were used in the study. Face 1 [5] and Face2 [6], both enforcing the 
MPEG-4 Facial Animation (FA) standard,. The point trajectories obtained from the 
motion tracking systems were converted into FAP streams through TRACK [8], and 
normalized according to the MPEG-4 FA to make them speaker-independent. The 
FAP streams were then used to animate the synthetic faces to produce the FAP condi- 
tion videos through screen capture. 

The video camera recordings of the actor were digitized and edited to be used for 
the Actor condition of the experiment. Finally, the script based condition consisted in 
recordings obtained by playing the relevant scripts. 

3.2.2 Experimental Design 

A within-subjects design was adopted: subjects were presented with 3 blocks 
(ACTOR, FACE 1 and FACE 2) of 14 video files each, yielding a total of 42 judg- 
ments per participant.^ The animation conditions (SB and FAP) were appropriately 
randomized within the two blocks of synthetic faces. 

As to emotional expressions, the videos covered the 6 emotions from Ekman’s set 
plus ‘neutral’. Each emotional state was expressed by the faces while uttering the 
Italian phonetically rich sentence “11 fabbro lavora con forza usando il martello e la 
tenaglia” (The smith works with strength using the hammer and the pincer). The audio 



1 

The actor was a male, while the faces we used were both female. It would have been inter- 
esting to control for cross-sex portability of FAP files, but this was not possible at the time 
the experiment was performed. We plan to address the issue in future studies. 

2 

The ACTOR block consisted of presentations of two series of videos from the same actor, 
called ACTORl and ACTOR2. This was done to control consistency of results with respect 
to the actor. Since no differences emerged, in the following our comparisons and discussion 
will be limited to ACTORl, with the exception of §4.2, where both types of data from the 
actor are used again. 
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was not presented. The task of the subjects was to identify the presented expressions 
by choosing from a list of the 7 available emotional states. 

3.2.3 Procedure 

Subjects were of 30 (15 males and 15 females) non-paid volunteers recruited at ITC- 
Irst. None was involved in the present project. They were given individual appoint- 
ments for the experimental session in the recording lab (a silent room), and were indi- 
vidually tested. Before the experimental session they were given written instructions 
and went through a short training session to familiarize with the task. The training 
session exploited 4 video files for each of the three faces (total number of 12 stimuli), 
with different stimuli than those to be used in the experimental session. The video files 
(320x360, AVI file, Indeo-5.10 compression) were presented on the computer screen, 
through Microsoft Power Point®. Each video file was presented only once. Each block 
had three different presentation orders, which were randomly created and balanced 
across conditions and participants. The presentation order of the three blocks was also 
balanced across participants. 

The experimental session started immediately after the training session. Participants 
were asked to watch the video files and express their judgement on a paper form, 
choosing from among the 7 available labels for emotional states (corresponding to the 
7 presented emotional expressions). At the end of the experimental session, they were 
given a 4 items questionnaire, to collect their feelings about the faces. 



4 Results 

4.1 Correct Recognitions 

Correct recognitions were analyzed by dichotomizing the responses (correct vs. 
wrong). Table 1 reports the results (recognition rates, RR) for each emotion and con- 
dition, and (last row) the mean recognition rates per condition across all emotions. 



Table 1. Percentages of correct recognitions for each emotion and condition. 





ACTORl 


Fl-FAP 


Fl-SB 


F2-FAP 


F2-SB 


anger 


90% 


27% 


53% 


7% 


23% 


happiness 


97% 


80% 


40% 


80% 


77% 


neutral 


70% 


70% 


60% 


53% 


67% 


disgust 


13% 


20% 


53% 


17% 


17% 


surprise 


47% 


40% 


87% 


33% 


90% 


fear 


50% 


17% 


77% 


0% 


77% 


sadness 


17% 


7% 


97% 


7% 


97% 


All 


55% 


37% 


67% 


28% 


64% 



We run a (multinomial) model selection analysis [9], that showed that the full model 
(including the main effects for each independent variable, and all the second, and third 
order interactions) was needed to adequately fit the data. Based on this, we then de- 
veloped a multinomial logit model, with the correct/ wrong responses as the dependent 
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variable, and the faces (ACTORl, Facel and Face2), the mode of animation (SB and 
FAP), and the presented emotions as the independent variables. 

Direct comparisons of the performances of the different faces in the different con- 
ditions were accomplished by computing the z-scores of the relevant log odd-ratios 
from the parameters of the logit model, along with their Wald confidence intervals 
(level of confidence: p<.01; confidence intervals for acceptance of the null hypothesis: 
-2.58<z<2.58).3 

Results from global comparisons (ignoring differences concerning presented emo- 
tions) show that ACTORl has better recognition rates than both faces in the FAP 
mode. The scores for the comparisons between ACTORl and the two faces in the SB 
condition don’t reach significance, though the one for Facel goes very close to doing 
so, at the chosen level (z-score= 2.48). Finally, both synthetic faces increase their 
recognition rate when going from the FAP- to the SB-condition. Hence, at a global 
level the SB condition is closer to ACTORl than the SB one. 

The results of a more fine grained analysis, addressing faces, mode of animation 
and presented emotion, are summarized in Tables 2 through Table 4. 



Table 2. Significant comparisons in- 
volving at least one S 5 mt. face in the SB 
mode. 


Table 3. Significant comparisons involv- 
ing at least one S 5 mt. face in the FAP 
mode. 


Anger 


ACTORl > Facel 


Anger ACTORl > Facel 




ACTORl >Face2 


ACTORl > Face2 


Happi- 


ACTORl > Facel 


Fear ACTORl > Face2 


ness 


Face2 > Facel 




Disgust 


Facel > ACTORl 
Facel > Face2 




Surprise 


Facel > ACTORl 
Face2 > ACTORl 




Sadness 


Facel > ACTORl 
Face2 > ACTORl 





Table 2 reports all the significant comparisons in which one (or more) of the synthetic 
faces was in the SB mode; all the omitted combinations did not yield significant dif- 
ferences. Table 3 reports the significant results for comparisons in which one (or 
more) of the faces was in the FAP -mode, and Table 4 informs about comparisons on a 
given face in the two modes of animation (SB vs. FAP). 

Table 4. Significant comparisons for the same face in the two modes. 



Happiness 


Facel-FAP > Facel-SB 


surprise 


Face2-SB > Face2-FAP 
Facel-SB > Facel-FAP 


fear 


Face2-SB > Face2-FAP 
Facel-SB > Facel-FAP 


sadness 


Face2-SB > Face2-FAP 
Facel-SB > Facel-FAP 



^ See [9] for details. 
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Summarizing the results, we have that: 

• The SB mode improves the recognition rates of both faces of the same amount 
and on the same presented emotions (surprise, fear and sadness), over the TAP 
mode (table 4). The latter benefits only Face 1 on happiness. Flence, the SB con- 
dition is superior to the FAP one, as far as RR is concerned.. 

• The FAP condition does not cause great differences across the synthetic faces, nor 
does the SB one, a part from minor differences (Face2 is superior to Facel on 
happiness, while the opposite obtains on disgust; Table 2). 

• With respect to ACTORl, the two faces in the FAP condition give either identical 
or poorer recognition rates (Table 3). This accords with the conclusions from the 
global analysis. 

• With SB, ACTORl is still better than Facel and Face2 on anger, and better than 
Facel on happiness. The situation reverses in favor of SB for both faces on sur- 
prise and sadness (Table 2). So, the global similarity between SB and ACTORl 
we observed above concealed important differences that tend to mutually cancel 
at the global level. 

In conclusion, on absolute grounds the SB mode has higher recognition rates than the 
FAP one. With respect to ACTORl, SB-faces do not globally differ from it, whereas 
ACTORl shows a global superiority over the FAP mode. When we go into details, 
however, the picture changes somewhat: the SB mode and ACTORl diverge on anger, 
where ACTORl outperforms SB-faces, and on surprise and sadness, where the oppo- 
site obtains. Now, anger is the only emotion on which ACTORl is stably superior to 
all faces in all conditions, suggesting that our faces as such are bad at it (or, the actor 
is very good at anger). The superiority of the SB mode on surprise and sadness, on the 
other hand, suggests that the scripts of the SB mode produce expressions that are more 
recognizable than those of our actor. 

Turning to the FAP mode, we should not hasten to conclude that it is ineffective. 
True, on the global tests they were worse than ACTORl. The detailed analysis, how- 
ever, shows that this is basically due to their poor performances on anger and, for 
Face2, on fear. If we discount anger on the same grounds as for the SB mode (the 
actor looks angry), and accept that fear is a real problem for Face2-FAP, in the re- 
maining conditions the FAP mode turns out to be closer in performances to ACTORl 
(the model) than the SB one; this accords with our expectations (see Table 3). 



4.2 Distributions of Recognition Errors 

We turn now to study errors, trying to understand whether and how the way they dis- 
tribute is affected by our independent variables: faces, mode of animation and pre- 
sented emotions. We will not resort to the same techniques of the previous section. 
Log-linear analysis can be easily extended to address the greater number of response 
categories (7 instead of 2) that is now required; however, the limited size of our sam- 
ple (30 people) would weaken our conclusions. Moreover, in this section we are inter- 
ested in finding simple but powerful tools to succinctly characterize errors and their 
distributions, allowing for easy comparisons; loglinear techniques do not directly 
provide for them. Thus, we will explore an information-theoretical approach [ 1 0] that 
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factors out various contributions to the global information/uncertainty displayed by 
confusion matrices, turning some of them into the tools we need. In this work we will 
focus on the number of confusion classes, and on the characterization of errors shared 
across conditions. Other important dimensions (e.g., typical error classes) will not be 
addressed here.'^ 

Table 5 reports the global confusion matrix, showing how correct responses and er- 
rors distribute across stimuli (rows) and responses (columns). 



Table 5. Overall confusion matrix (percentages). 
Happ. neu- 





Disg. 


tral 




fear 


anger 


surpr. 


sadn. 


disgust 


22% 


9% 


16% 


9% 


8% 


4% 


32% 


happiness 


4% 


78% 


8% 


1% 


4% 


3% 


3% 


neutral 


2% 


3% 


66% 


2% 


11% 


7% 


9% 


fear 


4% 


6% 


8% 


44% 


14% 


21% 


2% 


anger 


5% 


1% 


20% 


13% 


49% 


10% 


3% 


surprise 


1% 


9% 


7% 


13% 


8% 


61% 


2% 


sadness 


7% 


2% 


18% 


14% 


4% 


10% 


44% 



An appreciation of how errors distribute can be obtained by considering L, the mean 
entropy of the stimulus sequence that is not accounted for in the response sequence. 
For a given response category, r, L amounts to the (log) of the mean number of 
stimulus categories that cause responses to fall in r. Ideally, each response is induced 
by one and only one stimulus category (the right one), so that L=0. The converse of L 
is G, which informs about the (log) mean number of response categories for each 
stimulus category.^ 

Table 6 reports the results in term of 2^ (number of stimulus category per response 
category) and 2*^ (number of response category for stimulus category). As can be seen, 
the FAP-faces are quite different from the other combinations, having the greatest 
figures (more error categories) on both dimensions. We must take these data with 
some care, though, for L and G do not discount the distribution of errors in the confu- 
sion matrix, and are sensitive to the error rate; hence, greater error rates might give 
raise to larger L and G, this way biasing comparisons. 



Table 6. Values of and 2 ^ Table 7. Values of dr and ds 





2^ 


2 ° 




dr 


ds 


ACTOR 1 


2.61 


2.40 


ACTORl 


1.82 


1.52 


Face2-SB 


2.67 


2.46 


Face2-SB 


2.48 


1.97 


Facel-SB 


2.65 


2.48 


Facel-SB 


2.76 


2.27 


Facel-FAP 


4.28 


3.50 


Facel-FAP 


3.54 


2.57 


Face2-FAP 


5.18 


3.74 


Face2-FAP 


4.30 


2.74 



The price to pay to the information theoretic approach is that it does not come equipped with 
the rich inferential apparatus of other techniques. Hence we will not be able to anchor our 
conclusion to tests of statistical significance. 

^ L=Hcm-Hresp 
G^H,.jrj-Hgrjm 

where Hstim= entropy of the stimulus sequence, Hrest= entropy of the response sequence, and 
Hrm= entropy of the confusion matrix. 
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A more refined measure of the way responses distribute is provided by the indices ds 
and dr- The former measures the effective mean number of error (confusion) classes 
per stimulus, discounting the error distribution in the sense that stimuli with a low 
number of errors, which are spread across many response categories, contribute little. 
The other index, dr, informs about the mean number of stimulus categories a response 
category collects confusion from. In an ideal situation, both indices should be 0.® 

The results are reported in Table 7. Although the ordering is compatible with that 
of Table 6, discounting the error rate reveals interesting differences. The SB faces are 
now somewhat farther from ACTORl. Moreover, the variation of dj across synthetic 
faces is quite limited (range: 1.97-2.74) compared to the variation for d^ (2.48-4.30). 
We conclude that: a) the SB faces are the closest to ACTORl, as far as the number of 
error categories is concerned; b) the number of confusion categories along the stimu- 
lus dimension (ds) is substantially stable across synthetic faces and mode of animation; 
c) the number of confusion categories along the response dimension (d^) shows a clear 
ascending trend, when we move from ACTORl to SB faces and then to TAP ones. 
Thence, the real differences between the SB-mode and the TAP -mode on error classes 
concern the way response categories collect confusions (d^), rather than the number of 
error classes per stimulus category. The higher figures suggest that typical patterns of 
errors are less likely to emerge in the FAP mode. 

Suppose, now, that we want to know how similar is the error distribution along the 
stimuli dimension between two face*mode-of-animation combinations, sayFace2-FAP 
and Face2-SB, as a way to capture the contribution of Face2 to errors. The idea is that, 
to a certain extent, the errors that are shared between Face2-FAP and Face2-SB reflect 
Face2’s properties (the way it looks, the underlying animation engine, rendering, etc.), 
providing us with a measure of the confusions Face2 induces, independently from the 
condition (FAP vs. SB) it is presented in. To this end, we resort to indices 5s and 5„ 
which are computed on the pooled confusion matrix for Face2-FAP and Face2-SB. 
They yield the effective fraction of errors that fall outside the shared error categories, 
corrected for the overall differences in the distribution of stimuli (5s) and responses 
(5r). In a way, the lower these figures, the higher is the probability that a given error is 
due to Face2 itself, rather than to the mode of animation (or on any other intervening 
conditions). Table 8 reports results obtained by pooling together: ACTORl and 
ACTOR2 (reported as ACTOR), Facel-FAP and Face2-FAP (FAP), Facel-SB and 
Face2-SB (SB), Facel-SB and Facel-FAP (Facel), Face2-SB and Face2-FAP 
(Face2).* 



« ds=2<“-V' 

7 where L and G are as before, He is the error entropy of the confusion matrix, and e is the error rate. 

For reasons of space, we cannot report here the formulae for 5s and 5,. See [10] for more on 

8 this point. 

See fn. 2. 
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Table 8. Values of 5^ and 5s for the 
various conditions. 





8r 


5s 


ACTOR 


0.12 


0.14 


FAP 


0.2 


0.24 


SB 


0.24 


0.34 


Face2 


0.35 


0.57 


Facel 


0.39 


0.67 



Table 9. Values of 5, and 5s computed 
with respect to ACTORl. 





5. 


8s 


Facel-FAP 


0.22 


0.43 


Face2-FAP 


0.30 


0.56 


Facel -SB 


0.66 


0.78 


Face2-SB 


0.63 


0.79 



Neglecting ACTOR, under the proposed interpretation the mode of animation ac- 
counts for a greater fraction of the errors than faces do (excluding ACTOR). 

In Table 9 we have pooled the confusion matrices of each face*mode-of-animation 
combination with that of ACTORl, and then computed 5^ and 5s. The figures indicate 
the amounts of non-shared errors between each combination and ACTORl, and in- 
form us about how similar each combination is to ACTORl: the lower the fraction of 
errors they do not share, the more similar they are. Face 1 -TAP is the combination with 
the lowest figures, hence the one sharing the greatest amount of errors with ACTORl, 
closely followed by Face2-FAP. The two SB-faces are farther away, sharing fewer 
errors with ACTORl 

In conclusion, the analysis of errors has shown that: 

• SB-faces disperse errors on fewer confusion categories than FAP-ones, in this 
being closer to Actor (Table 7); 

• the FAP-faces share a greater amount of errors with ACTORl than the SB-faces 
(Table 9); 

• in a given face*mode-of-animation combination it is the mode of animation that 
accounts for the greater portion of errors (Table 8). 

We can interpret these results by saying that the mode of animation affects the error 
distribution more than the type of face (FACEl or FACE2). In detail, the confusion 
categories of the SB-faces don’t overlap much with those of the actor, this way deter- 
mining a low number of shared errors. That is, the error distribution of the SB mode is 
quite distinct from that of the actor on both the stimulus and the response dimension. 
The FAP-faces, on the other hand, because of their greater number of error categories 
(Table 7), share some of them with the actor, this way explaining the higher number of 
common errors (Table 9). In other words, the great number of common errors between 
the FAP faces and the actor is probably a consequence of the higher dispersion of 
error categories in the FAP conditions. 



4.3 Questionnaires 

At the end of the session each participant was asked to answer 4 close ended questions 
asking for which face they felt the judgement task was easiest/hardest and which 
among the synthetic faces was the most natural/pleasant. The actor was rated as the 
easiest face to judge (53%), and Facel got slightly better results than Face2. Con- 
cerning pleasantness/naturalness, Face2 was rated higher than Facel (59% versus 
41%). 
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5 Conclusions and Future Work 

In this paper we have proposed an approach to the assessment of the identifiability of 
emotional expressions performed by synthetic faces through two different modes of 
animation: the so-called script-based and the data-driven one. Both absolute and rela- 
tive assessments were pursued, the latter by comparing the expressions of two syn- 
thetic faces to those performed by an actor. With respect to previous studies, we have 
adopted more refined techniques: a loglinear analysis of the data for the recognition 
rate, and an information-theoretic approach for error analysis. The results indicate 
that, in absolute terms, the SB condition is superior to the data-driven one with both 
faces, as far as recognition rate is concerned. In relation to the human model, however, 
the data-driven method matches the model somewhat better. With respect to error 
distribution, both the SB and the FAP mode seem to differ from the human model, 
though for different reasons. All these results are largely independent of the face used. 

Besides allowing comparisons among different conditions, the proposed approach 
may directly impact on design and development choices. For instance, the fact that no 
major differences are exclusively due to the faces per se might suggest that the state- 
of-the-art of the relevant technologies is such that the appearance and other physical 
characteristics of the synthetic faces is presently less crucial than the way information 
about the relevant emotional expression is provided to them. Another possible indica- 
tion is that if recognizability is the ultimate goal, then script-based approaches seem to 
be appropriate: hand-written scripts allow to finely tune expressions till the desired 
results are obtained. On the other hand, if the focus is on ‘naturalness’, then data- 
driven methods are a ‘sort of good’ choice, because they produce recognition patterns 
close to those of the human model. However, they are still far from appropriately 
matching the model on error distribution, suggesting that design and development 
effort be focused on this aspect, in particular on reducing the number of error catego- 
ries. 

Turning to possible improvements, this study has not attempted to identify error 
categories; rather, we simply measured their numbers and common error fractions. 
However, information about the most common error categories, along both the stimu- 
lus and the response dimensions, would be extremely valuable to characterise how the 
face looks like in general. Besides this, there are factors that might affect the recogni- 
tion task, which have not been addressed here. For instance: a) the sex of the synthetic 
face and /or of the source for the FAPs: do the two interact? Do they interact with the 
sex of the subjects? b) Attractiveness: synthetic faces are built to be attractive, 
whereas (true) human faces aren’t. Has this any effect on our task? Other important 
directions for future investigations involve the relative importance (if any) of the up- 
per/lower part of the face in the expression of emotions: How much does recognition 
deteriorate (if it does) when emotional expressions are limited to the upper/lower part 
of the face? Finally, the methodology could be improved by extending measurement to 
reaction times, this way obtaining information on the difficulty of the judgments for 
the subjects; and by trying to relate subjective evaluations of the faces (as in §4.3) to 
the results of the data analysis. 
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Abstract. Modeling user affect is becoming increasingly important for intelli- 
gent interfaces and agents that aim to establish believable interactions with their 
users. However, evaluating the accuracy and effectiveness of affective user 
models is extremely challenging because of the many unknowns in affect com- 
prehension. In this paper, we overview existing approaches related to the vali- 
dation of affective user models, and we describe our own experience with an 
approach for direct model evaluation that we have used in a recent study. 



1 Introduction 

Recent years have seen a flourishing of research directed to add an affective compo- 
nent to human-computer dialogue. A key element of this endeavor is the capability to 
recognize user emotional states during the interaction, i.e., to build a model of the 
user’s affect. Building such models can be extremely challenging because it requires 
formalizing and applying strategies for emotion recognition that even human beings 
sometimes cannot generate or apply successfully. But validating the models can also 
be extremely hard because, in addition to the challenges common to any user model 
evaluation, validating affective user models suffers from the difficulty of obtaining 
reliable measures of user affect against which to compare the model predictions. 
Furthermore, because the research field is rather new, there are very few complete 
applications that include an affective user model and that can be used to test the model 
indirectly through evaluation of the application itself 

In this paper, we address the problem of how to validate affective user models for 
improving human-computer dialogue. We start with an overview of the available tech- 
niques. We then describe our experience in using one of these techniques to evaluate 
an affective user model for the interaction with an educational game. 

2 Overview on Techniques for Validating Affective User Models 

Most of the empirical work in affective user modeling has so far been directed to 
assessing possible sources of affective data, before building a complete user model 
that can use these sources. In particular, researchers have been trying to identify reli- 
able ways to recognize symptoms of emotional reactions, ranging from observable 
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changes in facial expressions, posture and intonation, to variations in lower level 
measures of emotional arousal, such as skin conductance and heartbeat. 

The standard technique applied in this line of work is emotion induction. Stimuli 
are devised to induce specific emotions in a set of subjects. Then, sensors are used to 
detect changes in behavioral expressions that are known to be influenced by these 
emotions. Finally, the sensors’ reliability in diagnosing emotional states from these 
measures of behavioral changes is evaluated. Techniques to obtain the desired emo- 
tions from the subjects include: (1) using a professional actor as a subject, who 
expresses the relevant emotions on demand [14]; (2) using standard stimuli for emo- 
tion elicitation that are not necessarily related to the task the affective user model will 
eventually be designed for, but that are known to be very reliable in generating the 
desired emotional states (e.g. movie clips [12]); (3) designing Wizard of Oz studies to 
elicit specific emotions in the context of the interaction that the affective user model 
will eventually support. So far, this approach has been mainly used to test frustration 
detection (e.g., [2], [11]). 

The main advantage of these emotion eliciting techniques is that they provide a re- 
liable base line against which to test sources of emotion data. The main disadvantage 
is that it may be difficult to generalize the reliability of data sources tested with these 
techniques to real interactions, because the user’s affective reactions may not be as 
intense, well defined and isolated as they are during the elicitation studies. 

In contrast with substantial research on validating sources of affective data, there 
has been little work on evaluating complete affective user models. As for other user 
models, affective models can be evaluated either directly by specifically measuring 
the accuracy of the model’s predictions, or indirectly by testing the performance of an 
application that uses the model to adapt its behavior to a user affect. To our knowl- 
edge, so far there have been only two informal indirect evaluations of affective user 
models, because there are very few complete applications that include an affective 
user model. The first evaluation used a sample set of simulated users and scenarios as 
a preliminary validation of a proof-of-concept prototype that adapts to anxiety in 
compact pilots [10]. The second evaluation included two field studies of the Avatalk 
architecture, designed to detect affect in speech. In both studies, Avatalk was used as 
a training aid for users that had to learn how to convey specific affective states 
through speech as part of their job. The studies focused on system acceptance, and 
provided no results on Avatalk effectiveness as a training tool [8]. Furthermore, the 
authors recognize that these types of macro studies do not allow assessing the contri- 
bution of the affective user model to system performance, because of the confounding 
variables introduced by the other components that define the system’s interactive 
dialogue (e.g., usage of synthetic characters to deliver instruction, character decision 
making). 

The direct approach to model validation overcomes two of the main shortcomings 
of the indirect approach. First, this approach does not require having a complete sys- 
tem, as the interaction can be carried out via a Wizard of Oz set up. Second, a direct 
evaluation can provide a deeper understanding of the model behavior that is not con- 
founded by other aspects of the application. Flowever, the main challenge of direct 
evaluation is that it requires having a reliable measure of the user’s affective states 
during the interaction for comparison with the model’s assessment. Depending on the 
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type of interaction and emotions that the model deals with, this measure can be quite 
hard to obtain. In the rest of the paper, we describe our experience in using this ap- 
proach to evaluate a model of user affect during the interaction with an educational 
game, Prime Climb. To our knowledge, this is the first direct evaluation of an 
affective user model with real users. Gratch and Marsella [7] discuss a direct evalua- 
tion where an appraisal model is run over an evolving situation taken from a psycho- 
logical instrument and then compared to subjects responses to the same instrument. 
However, because data was aggregated across subjects, this study does not assess the 
model ability to model individual differences. 

3 Validating a Model of User Affect in Educational Games 

The Prime Climb educational game 
has been designed by the EGEMS 
group at the University of British 
Columbia to help 6* and 7* grade 
students practice number factoriza- 
tion. It is a two-player game in which 
the players need to cooperate to climb 
to the top of a series of mountains, 
divided in numbered sectors (see 
Figure 1). Each player can only move 
to a sector that does not share any 
factors with the sector occupied by 
her partner, otherwise she falls. As for 
Fig. 1. The Prime Climb Interface several other educational games, Prime 

Crime has been shown to be very en- 
gaging, but to have limited pedagogical effectiveness because many students do not 
have the learning skills necessary to benefit from this highly unstructured and easily 
distracting pedagogical interaction. 

The long term research goal is to have a model of both student affect and knowl- 
edge that an intelligent pedagogical agent can use to balance student learning and 
engagement during the interaction with the game. Because of the complexity of the 
modeling task, we are building the affective and learning models separately, to pin 
down the factors that independently contribute to each assessment before proceeding 
to model the relevant synergies. We started to evaluate the model of student learning 
with the indirect approach, after building a pedagogical agent that uses the model to 
provide hints aimed at improving student learning [5]. However, we felt that we could 
not use the same approach to evaluate the affective user model. Because we still do 
not have a good understanding of how to build an agent that uses the affective user 
model to maintain student engagement in Prime Climb, there are too many aspects 
extraneous to the model that would cloud the interpretation of the indirect evaluation 
results in terms of model accuracy. Thus, we decided to try and evaluate the model 
directly, before building the agent that could use it. Before describing the evaluation 
methodology, we briefly illustrate our affective user model. More details can be found 
in [3] and [15]. 




Fig. 1. The Prime Climb Interface 
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3.1 The Affective User Model 

In contrast with other affective user models that assess one specific emotion (e.g., 
[10]), or measures of valence and arousal (e.g., [1],[2]) our model assesses multiple 
specific emotions that we observed to influence the interaction with Prime Climb 
during pilot studies on the game. These are six of the 22 affective states described in 
the OCC cognitive theory of emotions [13]: joy! distress for the current state of the 
game (i.e., the outcome of a student or agent action), pridetshame of the student to- 
ward herself, and admirationi reproach toward the agent. 

The model relies on a Dynamic Bayesian Network (DBN) to probabilistically as- 
sess user emotional states from possible causes of emotional arousal, as described in 





Fig. 2. Two time slices of the affective model (A); sub network to assess goals (B). 

the OCC theory of emotions. In this theory, emotions arise as a result of one’s ap- 
praisal of the current situation in relation to one’s goals. Thus, our DBN (see two 
sample slices in Figure 2 A) includes variables for possible user’s Goals when playing 
with Prime Climb, and for situations consisting of the outcome of any user or agent 
action (nodes User Action Outcome and Agent Action Outcome in Figure 2A). The 
desirability of an action outcome in relation to the user’s goals is represented by the 
node class Goals Satisfied. This in turn influences the user’s emotional states, repre- 
sented in the DBN by three binary nodes {Emotions for Outcome, Emotions for Self 
Emotions fi)r Agent) that model the six emotions mentioned above. 

User goals are a key element of the OCC model, but assessing these goals is non 
trivial, especially when asking the user directly is not an option (as is the case in edu- 
cational games). Thus, our DBN also includes nodes to infer user goals from indirect 
evidence coming from both user’s Personality [6] and Interaction Patterns. Because 
all the variables in this sub network are observable, we identified the variables and 
built the corresponding conditional probability tables (CPTs) using data collected 
through a Wizard of Oz study during which an experimenter guided the pedagogical 
agent. In these studies, students took a personality test based on the Five Factor per- 
sonality theory [6]. After game playing, students filled a questionnaire on what goals 
they had during the interaction. The probabilistic dependencies among goals, person- 
ality, interaction patterns and student actions were established through correlation 
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analysis between the tests results, the questionnaire results and student actions in the 
log files recorded during the interactions [15]. Figure 2B shows the resulting sub- 
network. 

In the sub-network that represents the appraisal mechanism, the links and CPTs 
between Goal nodes, the outcome of student or agent actions and the Goal Satisfied 
nodes, are currently based on our subjective judgment. Some of these links are quite 
obvious, i.e., if the student has the goal Avoid Falling, a move resulting in a fall will 
lower the probability that the goal is achieved. Other links (e.g., those modeling which 
actions cause a student to have fun or learn math) are less obvious, and could be built 
only through student interviews that we could not include in our studies. When we 
did not have good heuristics to create these links, we did not include them in the 
model. 

The links between Goal Satisfied and emotion nodes are defined as follows. Be- 
cause the outcome of every agent or student action is subject to student appraisal, 
every Goal Satisfied node influences Emotions for Outcome in any given slice (see 
Figure 2A, both slices). Whether a Goal Satisfied node influences Emotions for Self or 
Emotions for Agent in a given slice depends upon whether it was the student (slice f in 
figure 2A) or the agent (slice f+i in figure 2A) who caused the current game state. The 
CPTs of emotion nodes given goal-satisfied nodes are defined so that the probability 
of the positive emotion is proportional to the number of Goal Satisfied nodes that are 
true. 



3.2 Model Evaluation 

A direct evaluation of our affective user model requires ascertaining the actual emo- 
tions that students experienced during the interaction with Prime Climb. However, 
because these emotions tend to be ephemeral and can change multiple times during the 
interaction, it is unrealistic to expect that, after game playing, subjects remember the 
affective states they went through. A technique that is often used to help subjects rec- 
ollect volatile states is to show them a video of their interaction. However, we could 
not use this approach both because we could not keep our subjects for the required 
additional time and, more importantly, because it is highly unlikely that our 10 and 
1 1 year old students would be willing or able to undergo this procedure. Having an 
experimenter code the subjects’ emotional states from a video of the interaction was 
also unlikely to yield reliable results. When we tried to use this technique in another 
Prime Climb study to test biometric sensors for emotion detection [4], our video 
recordings showed that users’ visible bodily expressions often did not give enough 
indication of their specific emotions, although they were more reliable for detecting 
valence and arousal. 

Given the above factors, we decided to devise a strategy to obtain the information 
on their emotions directly from our subjects during the interaction. However, this 
approach is chancy because, if not done properly, it can significantly interfere with the 
very emotional states we want to assess. Furthermore, subjects’ self-reports can be 
unreliable both because of a well known tendency that subjects have to give artifi- 
cially positive answers out of politeness, and because some subjects may not be able 
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to assess their emotional states. Both 
these phenomena will have to be taken 
into account when using self-reports in 
empirical studies. Nonetheless, for our 
specific type of application, they seem 
to be the least noisy source of informa- 
tion. 

In the rest of the paper, we first de- 
scribe our approach to design an inter- 
Fig. 3. Emotion-reporting dialogue box f^(.g elicit emotion self-reports 

during Prime Climb playing as unobtrusively as possible, given the constraints im- 
posed by this type of interaction. We then discuss some general methodological sug- 
gestions to deal with the potential unreliability of emotions self-reports in testing 
affective user models. 




4 Pilot Study 



To find an unobtrusive way to obtain emotion self-reports during the interaction with 
Prime Climb, we ran a pilot study with 6* grade students in a Vancouver school. 
The study tested two alternative interfaces. One interface had an emotion-report dia- 
log box permanently present on the side of the game window (see Figures 3 and 4). 
This interface is quite unobtrusive but may not generate much data because it relies 
on the student’s willingness to volunteer self-reports. The second interface included 
both the permanent dialog box, as well as the same dialogue box that would pop up 
whenever either one of the following conditions were satisfied: (1) the student had not 
submitted an emotion self-report for a period of time longer than a set threshold or (2) 
the underlying affective model was detecting a relevant change (also based on a set 
threshold) in its belief of the student’s emotional state. Students are required to sub- 
mit a self-report when the pop-up dialogue box appears. 




Fig. 4. Interface with both the permanent and pop-up dialogue box 
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The questions that we were trying to explore with this pilot study were the follow- 
ing: 

1. Do students volunteer self-reports in the permanent dialogue box frequently 
enough to provide sufficient data for model evaluation? 

2. If not, and we must resort to the pop-up dialogue box, how do students tolerate 
it? 

3. If we need the pop-up box, is it worth keeping the permanent box for those stu- 
dents who still want to volunteer affective information? 

As Figure 3 shows, the emotion dialogue box only elicited information on two of 
the three sets of emotions targeted by our model, both because it was felt that dealing 
with three different emotional states would be too confusing for our subjects, and 
because teachers suggested that students would have more problems in reporting 
emotions toward themselves than toward the game or the agent. Not having explicit 
information of pride/shame was not a serious limitation because we could still derive 
information on these emotions from the information obtained on the other two pairs. 

We are aware that directly asking the students about their feelings may feel unnatu- 
ral and perhaps too pushy. A more discreet approach that is often used in emotional 
psychology is to ask about factors that are antecedents of emotions, such as goals and 
expectations. However, getting a reliable assessment of students emotions through this 
indirect approach would require too many questions to be acceptable in a real time 
interaction as fast paced as that with Prime Climb. 

The study set up was as follows: students were told that they would be playing a 
game with a computer-based agent that was trying to understand their needs and giv- 
ing help accordingly. Therefore, the students were encouraged to provide their feel- 
ings whenever there was change in their 
emotions so that the agent could adapt its 
behavior. In reality, the agent was directed by an 
experimenter through a Wizard of Oz interface. 
A key difference between our study and previous 
studies that used the Wizard of Oz paradigm 
(e.g., [2], [11]) is that our experimenter did not 
try to elicit specific, intense emotions through 
extreme behavior, because we wanted the results 
of the study to be generalizable to normal 
interactions in which these extreme behaviors are 
hopefully the exception and not the rule. Thus, 
the experimenter was instructed to provide help 
anytime the student showed difficulty with the climbing task and factorization knowl- 
edge, to resemble the behavior of the pedagogical agent for student learning that at the 
time of the study was still under development. Help could be provided at different 
levels of detail (from a general suggestion to think about the factorization of the num- 
bers involved in an incorrect move, to telling the student where to move). In the con- 
dition with the pop-up dialogue box, all the experimenter’s hints and student’s actions 
were captured by the affective user model, and the game would pop-up the additional 
dialogue box following the criterion described earlier. For logistic constraints, the 
experimenter had to act as a companion in the climbing task, but he also directed the 
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pedagogical agent in a Wizard of Oz fashion. This could be done because the student 
could not see the experimenter’s screen (see Figure 5 for a similar study setup). All 
together, 10 students participated in the study. 4 students used the version of the game 
with the permanent dialogue box only (PDB), 6 used the version including the pop-up- 
box (PDB-i-POPUP)‘. After 10-15 minutes of game playing, the students completed a 
questionnaire targeting the questions on interface acceptance described earlier (see 
Table 2). We also recorded how often students volunteered information in the emo- 
tion dialogue boxes (see Table 1). 

The “# per student” column in Table 1 reports for each interface how many self- 
reports each student volunteered in the permanent dialogue box. The numbers show 
that some students tend to volunteer few self-reports, suggesting that the permanent 
dialogue box by itself may not consistently generate sufficient data for model testing. 
On the other hand, the numbers for volunteered self-reports in the PDB-POPUP row 
also shows that some students continued to volunteer information in the permanent 
dialogue box even if they knew that the system would ask for the information explic- 
itly when it needed. 



Table 1. Number of self-reports generated in the two interfaces 





Volunteered self-reports 


Reports in pop-up box 


Group 


# per student 


Mean 


# per student 


Mean 


PDB only 


1,2,4,10 


4.25 


NA 


NA 


PDB- 

POPUP 


0,3,3,4,7,12 


4.48 


16,20,26,18,16,36 


21.6 



Thus, this pilot study suggests that we do need the pop-up box to proactively elicit 
self-reports, although it is worth keeping the permanent box around for those students 
who tend to volunteer information. We then checked the post questionnaire for stu- 
dents in the PDB-POPUP group to see how they tolerated the pop-up dialogue box. 
Table 3 reports the results. Students’ answers are on a Likert scale from 1 (strongly 
disagree) to 5 (strongly agree), where 3 represents indifference. 

Table 2. Average scores for post questionnaire items in the pilot study; scores are on a Likert 
scale from 1 (strongly disagree) to 5 (strongly agree). 



The popup dialog box showed up too frequently. 


4.2 


The popup dialog box interfered with my game playing. 


3.2 


The permanent dialog box interfered with my game playing. 


2.56 


The questions in the dialog box were clear to me. 


4.11 


It bothered me having to tell the system how I feel. 


2.56 



The averages in Table 2 suggest that, although students on average found that the 
pop-up box showed up too frequently, it did not seem to interfere too much with 
game playing. This gave us confidence that, by adjusting the pop-up frequency, we 
could have an interface students can live with and that provides enough data for 
model evaluation. 



' For technical reasons we could not divide students equally between the two groups. 
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Another encouraging result from the questionnaire is that students didn’t seem to 
mind having to express their emotions to the system. This is important to reduce the 
possibility that the mere act of having to express their emotions is upsetting for stu- 
dents, regardless of the available self-reporting mechanism. The questionnaire also 
asked for suggestions on how to improve the dialogue boxes or the emotion input 
mechanism in general, but we did not get any relevant answers. 

Given the results of the pilot study, we decided to run the empirical evaluation of 
our model by using the interface with both dialogue boxes. However, we adjusted the 
algorithm that manages the pop-up box so that the box would appear less frequently. 

5 Using the Two-Box Interface for Model Evaluation 

Twenty 7* grade students from a local Vancouver school participated in the study to 
evaluate the Prime Climb affective model. The study set up was exactly the same as 
for the pilot study described earlier, except that all the students used the two-box inter- 
face, 

and we had two sessions running in parallel, with two students playing with two ex- 
perimenters who also directed the agent via the Wizard of Oz interface (See Figure 5). 
All students’ inputs in the dialogue boxes were collected for comparison with the 
corresponding model assessments. After game playing, students fdled out the same 
questionnaire on interface acceptance that was used in the previous study. In this 
section, we discuss whether the two-dialogue boxes set up matched our expectations 
as a technique for model testing. In particular, we focus on how to address a set of 
questions that are fundamental to ascertain the reliability of any technique that relies 
on self-reports for direct evaluation of an affective user model. 

(1) What was the user acceptance of the interface to elicit emotion self-reports? 

(2) If there are subjects annoyed by this interface, do we have to discard their data? 

(3) How reliable are the self-reports elicited through this interface? 

What is the acceptance of the interface for emotion self-report? 

Table 3 reports the average scores on each post-questionnaire item. Although some 
students are still bothered by the pop-up dialogue box, the average level of declared 
interference is fairly low. As in the first study, acceptance of the permanent dialogue 
box is very high, students did not seem to mind telling the system how they felt, and 
were pretty clear about what the dialogue boxes were asking. 



Table 3. Average scores for questionnaire items in the second study 





Question type 


Mean 


StDev 


c 


The popup dialog box showed up too frequently. 


3.4 


1.5 


D 


The popup dialog box interfered with my game playing. 


2.8 


1.4 


E 


The permanent dialog box interfered with my game playing. 


1.9 


1.4 


F 


The questions in the dialog box were clear to me. 


4.5 


0.9 


G 


It bothered me having to tell the system how I feel. 


2.1 


1.1 
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What to do with subjects who were annoyed by the interface? 

The questionnaire results showed that 10 students gave a rating higher than 3 to the 
question asking whether the pop-up box showed up too frequently (question C in 
Table 3), or to the question asking whether the dialogue box interfered with game 
playing (question D). Because the ratings indicate that these students were somewhat 
annoyed by the popup box, what should we do with their data? If they were truly 
upset by the pop-up box, perhaps we should discard their self-reports when computing 
the model’s accuracy. The model would never be able to detect their negative affect 
since it does not take into account the appearing of the dialogue box in its assess- 
ment (recall from Section 3.1 that the model was not built using data from emotion 
self-reports). More specifically, the model would tend to underestimate the players’ 
emotion toward the game {Distress) and perhaps toward the agent {Reproach). 

To test whether this was the case, we computed the model accuracy in detecting 
distress and reproach for the subset of students who gave a score higher than 3 to 
questions C and D in the post-questionnaire (see Table 4 and Table 5). We then 
compared this accuracy with the accuracy for the students who did not report annoy- 
ance with the dialogue boxes. Model accuracy is computed as the fraction of the stu- 
dents’ reported emotions that the model predicted correctly. The accuracy for students 
who were not annoyed with the dialogue box is 100% for Distress and 75% for Re- 
proach. Table 4 and Table 5 show lower accuracies on Distress and Reproach for 
those students who reported annoyance with the dialogue box. Unfortunately we don’t 
have sufficient data to make reliable conclusions based on these numbers. The differ- 
ence between the two accuracies is not statistically significant. Thus, we cannot reject 
the null hypothesis that annoyance with the dialogue box did not affect the player 
attitude toward the game, and we have no basis to eliminate the self-reports of the 
annoyed subjects from the analysis of model accuracy. On the other hand, we are 
aware that lack of statistical significance could also be due to the limited number of 
subjects and their uneven distribution between the two groups. 



Table 4. Analysis of Distress accuracy for students who reported annoyance with the dialogue 
box 



Question- 

naire 

Answers 


Students 


Students who reported 
Distress 


# of Distress 
Datapoints 


Correct 

predictions 


4 or 5 to 
question D 


X5,X6,X10 

E2,E5,E6 


X5, E2 


4 


3 (75%) 


4 or 5 to 
question C 


X1,X4,X5,X6 
E3, E4, E5 


X5 


3 


2 (66.7%) 



Table 5. Analysis of Reproach accuracy for students who reported annoyance with the dialogue 
box 



Question- 

naire 

Answers 


Students 


Students who 
reported Re- 
proach 


# of reproach 
Datapoints 


Correct 

predictions 


4 or 5 to 
question D 


X5,X6,X10 

E2,E5,E6 


0 


0 


— 


4 or 5 to 
question C 


X1,X4,X5,X6 
E3, E4, E5 


XI, E4 


2 


2 (100%) 
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What we can say, however, is that negative emotion reports were only a small 
fraction (4% for Distress and 2% for Reproach) of all the emotion reports generated 
by the 10 students who had declared annoyance with the dialogue box (9 reports on 
average for each emotion pair). Thus, if we can trust the students’ self-reports, these 
results could be interpreted as an indication that annoyance with the dialogue box, in 
fact, does not always translate into annoyance with the game or the agent. This would 
be in itself a quite encouraging finding for researchers interested in evaluating 
affective user models, because it shows that subjects can tolerate to some extent the 
interference caused by the artifacts designed to elicit their emotions. We obviously 
need more data before we can draw any reliable conclusion on this issue. And because 
any such conclusion would have to rely on an analysis of the student emotion self- 
reports similar to the one discussed above, we also need to understand what the reli- 
ability of these self-reports is. This takes us to the third and final question in this dis- 
cussion. 

How reliable are students self-reports elieited through the interface? 

All in all, in this study we had far fewer self reports of negative than of positive emo- 
tions. Of the 130 self-reports on the joy/distress pair, only 9 were for distress. Of the 
103 reports on admiration/reproach pair, only 6 were for reproach. On the one hand, 
this could be taken as further evidence that user self-reports are unreliable because 
users often tend to give answers out of politeness. On the other hand, the reader 
should recall that in our study the experimenters were not trying to induce negative 
emotions, they were simply trying to provide help every time they thought a student 
was not learning well from the game. Our initial expectation was that these tutorial 
interventions would often be annoying because they would interfere with the game- 
like nature of the interaction. However, it may be that we underestimated the stu- 
dents’ desire to learn from the game or the novelty effect of interacting with an ani- 
mated pedagogical agent. Furthermore, students may not have encountered many 
situations in which the game itself became annoying or fhistrating. 

To gain a better understanding of how reliable our subjects’ self-reports are, we 
looked at the log files to identify those situations that did generate negative reports 
from some of the subjects, to see if and how often they appeared in conjunction with 
positive reports. A preliminary analysis shows that only 1 student (XI in Table 5), 
gave both a negative and a positive report in response to the same, potentially negative 
situation. In a single sequence where the student is not making climbing progress, XI 
gave an admiration report of 5, followed by 1 (indicating reproach), followed again by 
5. This is the only instance we have been able to find in our log files of potentially 
inconsistent self-reports, providing evidence toward the hypothesis that in our study 
we did not get many negative self-reports not because of the subjects tendency to 
please the experimenter, but because they mostly did not experience negative emo- 
tions. This does not prove, of course, that our interface for emotion self-reports is 
generally reliable. More testing should be done with interactions that do induce more 
negative emotions. However, this analysis shows how log data can be used as an alter- 
native to (or in conjunction of) video data to integrate and validate specific sets of 
emotion self-reports. 
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6 Conclusions and Future Work 

In this paper, we have addressed the problem of how to evaluate affective user mod- 
els. Because affective user modeling is a relatively new research field, there is very 
little knowledge on how to best evaluate these models, especially if they try to assess 
a variety of specific emotions in fairly unconstrained interactions that tend to generate 
different affective reactions in different users. We have reviewed the techniques that 
can be used to validate an affective user model or the sources of data that it uses, and 
we have discussed the application of one of these techniques, direct model evaluation, 
to the validation of a model of student affect during the interaction with an educational 
game. Direct model evaluation is advantageous both because it does not require a 
complete system that uses the affective model and because it gives more precise in- 
formation on model performance and the factors that influence it. However, it poses 
the challenge of obtaining a measure of the user’s actual emotions during the interac- 
tion. We have illustrated a mechanism that we have devised to obtain this information 
as unobtrusively as possible, and we have presented an analysis aimed at understand- 
ing whether we have succeeded. In particular, we have shown how we tried to answer 
three questions that are key to defining the effectiveness of any mechanism for emo- 
tion self-report to directly evaluate affective models: (1) how intrusive the mechanism 
turns out to be; (2) what to do with the data from subjects that do find the mechanism 
intrusive; (3) how to assess the reliability of the obtained self-reports. Our answers are 
preliminary, and currently limited to the specific application and user population in- 
volved in the study presented here. However, as research in affective modeling pro- 
gresses, we hope that more and more of these answers will be provided through em- 
pirical model evaluations. This would help create a set of standards and guidelines 
that can streamline the evaluation process and allow researchers to adopt a specific 
evaluation method with a clear understanding of its possible sources of inaccuracies 
and related compensation strategies. 
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Abstract. This paper reports the results of a preliminary cross-evaluation 
experiment run in the framework of the European research project PF-Star*, 
with the double aim of evaluating the possibility of exchanging FAP data 
between the involved sites and assessing the adequacy of the emotional facial 
gestures performed by talking heads. The results provide initial insights in the 
way people belonging to various cultures react to natural and synthetic facial 
expressions produced in different cultural settings, and in the potentials and 
limits of FAP data exchange. 



1 Introduction 

Analysis and synthesis of human-like facial expressions, in particular synchroni- 
zations of synthetic expressions with speech output, is achieving growing attention in 
the development of embodied conversational agents. 

Within the European project PF-Star a whole work package is dedicated to 
synthesis of facial expressions of emotions, and much effort has been spent to design 
a methodology for evaluation of the emotional facial gestures performed by 3D 
animated talking heads. Particular attention has been paid to harmonize the 
infrastructure and facilitate the exchange of models and data between the involved 
sites. Exchanging data not only increases integration and cooperation possibilities, but 
also gives each data set a higher productivity, since it can be used at different sites, 
and the results compared, even in a cross-cultural perspective. 

The experiment we describe in this paper aimed at identifying crucial areas of 
synchronization of data and methods across sites, addressing issues related to the 
adequacy of the emotional facial expressions performed by talking heads and trying to 



' Project website: http://pfstar.itc.it 
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detect cultural factors in their recognition. The experiment involved the Swedish and 
Italian partners. The design principles were inspired by Ahlberg, Pandzic and You’s 
[1] evaluation procedure for MPEG-4 facial animation players. They propose to 
measure the expressiveness of a synthetic face through the accuracy rate of human 
observers who recognise the facial expression, and to compare the expressions of the 
synthetic face with those of the “original” human face, upon which they are based. 
Similarities and dissimilarities between their methodology and ours are discussed in 
details in [2]. 

Section 2 of the paper illustrates how the test data were acquired and exchanged 
across sites. Then in section 3 the experiment is described and some preliminary 
results are presented and discussed in section 4. 



2 Materials 

Preparation of data involved: recording actors uttering a series of stimuli acted with 
emotions in different conditions, production and exchange of the related MPEG-4 
FAP (Facial Animation Parameters) fdes, and animation of the FAPs sequences using 
different synthetic faces. 

Similar data acquisitions have taken place at both the involved sites, using opto- 
electronic systems able to capture the dynamics of emotional facial expressions with 
very high precision. The Swedish corpus was recorded using a four-camera Qualisys 
MacReflex system [3], capturing 35 markers at a rate of 60 frames per second. The 
Italian corpus was collected with the Elite system [4], which uses two cameras with a 
frame rate of 100 FIz to capture 28 markers. Both corpora include a common sub-set 
of data for cross-site comparisons, limited to 10 nonsense words (which have a very 
similar pronunciation both in Italian and Swedish) uttered 3 times each with 3 
emotions: neutral, angry and happy. The nonsense words recorded in Sweden were 
uttered in isolation, while those recorded in Italy were preceded by an opening word 
and followed by a closing one, which were uttered with neutral expression, and were 
then cut out of the video files, in order to reduce the differences with the Swedish 
stimuli. Flowever, some differences could not be cancelled, because the Italian videos 
were neither starting, nor finishing with a rest position of the mouth. Both the 
recorded speakers were male actors in their thirties. 

Two synthetic 3D face models were used in the study, one originating from 
Sweden [5] and one from Italy [6]. The Swedish face, a male, is made up of 
approximately 1,500 polygons, whereas the Italian face is a textured young female 
built using about 25,000 polygons. Both models adhere to the MPEG-4 Facial 
Animation (FA) standard, which makes it possible to drive them from the same data. 
The FAPs are normalized according to the MPEG-4 FA standard, so that they are 
speaker-independent. The point trajectories obtained from the motion tracking 
systems described above were converted into FAP streams with custom made 
software. The FAP streams were then used to animate both the synthetic faces. No 
adjustments to the synthetic faces models were performed in order to drive them from 
the same FAP streams. 
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3 Experiment 

One group of Italian university students (47 volunteer students from the dept, of 
Psychology of the University of Trieste) and one group of Swedish university 
students (30 volunteers from the dept, of linguistic of the University of Stockholm 
and the dept, of engineering of KTH) were confronted with four blocks of 12 video- 
files each: 1) Italian actor, 2) Swedish actor, 3) Swedish synthetic face playing both 
Italian and Swedish FAP-files, and 4) Italian synthetic face playing both Italian and 
Swedish FAP-files, for a total of 48 stimuli per participant. Two nonsense words, 
ABBA and ADDA, uttered with three emotional states {happy, angry and neutral), 
were selected from the common sub-set of data. The stimuli were played without the 
audio. 

Before the experimental session the participants were given written instructions 
and were involved in a short training session to familiarise with the task. During the 
experimental session, the video-files were presented individually on the computer 
screen, in a randomised order. After each of presented video-file, the participants 
were asked to choose, on the answering sheet, among the three available labels for the 
emotional states. At the end of the experimental session, they were also asked to fill in 
a short questionnaire about their impressions concerning the faces. 



4 Results and Discussion 

The average percentages of correct recognition, reported in Table 1, show that both 
human faces got higher rates than synthetic faces. Responses from Italian and 
Swedish participants have been collapsed, since there were not any significant 
differences between them (see below). 



Table 1. Percentages of correct recognition for each emotion and condition. IT = Italian, SW = 
Swedish, ACT = actor, SYN = S 5 mthetic face. 





IT ACT 


SW ACT 


IT SYN 
IT-FAP 


IT SYN 
SW-FAP 


SW SYN 
IT-FAP 


SW SYN 
SW-FAP 


Angry 


92% 


81% 


54% 


23% 


41% 


66% 


Happy 


67% 


88% 


84% 


79% 


41% 


77% 


Neutral 


68% 


91% 


71% 


94% 


71% 


79% 


All 

emotions 


76% 


87% 


70% 


65% 


51% 


74% 



We performed two separate loglinear (multinomial logit) analysis [7] of the data with 
dichotomised responses (correct vs. wrong). In the first the independent variable were 
the actor (Italian vs. Swedish), the presented emotion, and the subjects (Italian vs. 
Swedish). The results indicate (p<.01): an overall strong tendency towards correct 
responses; no effects (both main and interactions) of subjects on responses; a 
significant lowering of recognition rate by the Italian actor on all presented emotion, 
with the exception of anger. In the second loglinear analysis the independent 
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variables were: the subjects, the synthetic face (Italian vs. Swedish), the type of FAP 
files (Italian vs. Swedish) and the presented emotion. The results indicate (p<.01): a) 
an overall trend towards correct responses; b) that the Swedish FAPs negatively affect 
the Italian face on anger, c) that the Italian FAPs negatively affect the Swedish face 
on happiness. This suggests than even if exchanging FAPs is technically feasible, one 
should be careful in assuming that it does not have any consequence. 

As to comparison among the emotional states, there is a significant (p<.01) trend 
for happiness to be better recognised than anger. 

According to the post-session questionnaire, the Swedish actor was considered as 
the easiest to judge (68%), as well as more natural (54%) compared to the Italian 
actor (39%). The Swedish synthetic face was the hardest to judge and the Italian 
synthetic face was judged as the most pleasant (45%). 



5 Conclusions 

The results of this preliminary evaluation show that differences in the recognition of 
emotional expressions from synthetic agents seem to be due to the provenance of 
FAPs, while there appear to be no differences in the way participants, belonging to 
two different cultures (Italian vs. Swedish), react to natural and synthetic facial 
expressions produced in different cultural settings. Drawing clearer conclusions is not 
possible at this point, since many factors related to cross-sites differences in recording 
conditions may have affected the results. Before planning to run further cross- 
evaluation experiments it is necessary to pay more attention to intervening factors, 
such as: differences in the recording conditions, differences in the synthetic face 
models. 
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1 Introduction 

This paper describes the running first prototype (PTl) of the NICE Hans Christian 
Andersen system. NICE stands for Natural Interactive Communication for Edutain- 
ment (http://www.niceproject.com/). In this EU project (2002-2005), we aim to dem- 
onstrate domain-oriented conversation, including 2D gesture input, with life-like 
animated fairytale author Hans Christian Andersen (HCA). By contrast with task- 
oriented spoken dialogue [1], domain-oriented conversation has no task constraints. 
The user can address, in any order, any topic within HCA’s knowledge domains, 
using spontaneous speech and mixed-initiative dialogue. In PTl, the domains are: 
HCA’s works, his life, his physical presence in his study, the user, and HCA’s role as 
“gate-keeper” for access to the fairytale world which is not described here. In 
addition, HCA has a ‘meta’ domain to be able to handle meta-communication during 
conversation. HCA reacts emotionally to the user’s input, e.g. by getting angry or sad 
due to what the user says, or happy if the user likes to talk about his fairytales. The 
HCA system is not an information system. It attains its educational goal by providing 
correct factual information, both visually and orally, but an equally important goal is 
to entertain through human-like conversation, to make the target users of 10-18 years 
old kids and teenagers pleased by having met someone of, and from, a different age 
who is much more like themselves than expected. 

Below, we present the HCA system architecture, focusing on general architecture 
and information flow, as well as NISLab ’s natural language understanding, character 
modelling, and response generation modules. 



2 General Architecture 

The HCA system’s event driven, modular, asynchronous architecture is shown in Fig- 
ure 1 . In addition to the modules explained in more detail below, modules are (provi- 
ded by): speech recogniser (Scansoft, not in PTl); gesture recognition (freeware); 
gesture interpretation, input fusion (LIMSl, no semantic fusion in PTl); speech syn- 
thesis (Scansoft), including time calculation for animation tags; and animation, inclu- 
ding character animation and virtual world simulation (Liquid Media). The modules 
communicate via a central message broker, publicly available from KTH at http:// 

E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 305-308, 2004. 
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www.speech.kth.se/broker. The broker is a server which routes function calls, results 
and error codes between modules. The Transmission Control Protocol (TCP) is used 
for communication. The broker coordinates input and output events by time-stamping 
all module messages and associating them to a certain conversation turn. The 
behaviour of the broker is controlled by message-passing rules, specifying how to 
react when receiving a message of a certain type from one of the modules. 




Fig. 1. General NICE HCA system architecture. 

In terms of information flow, the speech recogniser sends an n-best set of hypotheses 
(PT2) to natural language understanding which sends a 1-best hypothesis to input 
fusion. Similarly, the gesture recogniser sends an n-best hypothesis set to the gesture 
interpreter which consults the animation module as to which object the user may have 
indicated. In PTl, the input fusion module simply forwards an n-best list of pairs of 
(recognised pointable object -i- gesture confidence score) from the gesture interpreter 
and/or a 1-best natural language understanding output to the character module which 
takes care of input fusion, when required. The character module sends a coordinated 
verbal/non-verbal output specification to the response generator which splits the out- 
put into synchronised text-to-speech and animation. Synchronisation is handled by the 
animation module. For comparison, see, e.g., the architectures in [2]. 

3 Language Understanding, Character Module, Response 
Generation 

The natural language understanding (NLU) module manager (Figure 2) manages in- 
ternal NLU communication. Each domain has a set of keyphrases. The keyphrase 
spotter spots phrases in the user utterance and converts them into syntactic/semantic 
categories. The output is passed on to the syntactic analyser which consists of a num- 
ber spotter, a lexicon and a rule engine. The number spotter spots numbers in the in- 
put, indicating, e.g., the user’s age. The lexicon entries consist of syntactic/semantic 
categories for individual words. After passing through the number spotter and lexicon, 
the user input is a sequence of semantic and syntactic categories. The rule engine then 
applies rules defined on the presence of certain semantic/syntactic categories at spe- 
cific positions in the sequence. The domain/topic spotter spots the input topic(s) by 
mapping the semantic/syntactic categories to their respective topics. The mapping is 
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defined at design time. Domains are identified based on topics. The result is sent to 
the FSA (Finite State Automaton) processor which acts as the deepest level of par- 
sing. If the user sequence is able to traverse an FSA, the result corresponding to that 
FSA is the NLU output semantics. The FSAs are developed off-line from a training 
corpus. The result consisting of domain(s), topic(s) and semantics is sent to the input 
fusion module which forwards the result to the character module. 




Fig. 2. NICE HCA natural language understanding module. 

The HCA character module (Figure 3) is managed by the character module manager 
which also takes care of module-external communication. Viewed as a whole, the 
character module is in one of three output states, producing either: non-communica- 
tive action output, communicative function output, or communicative action. Non- 
communicative action (NCA) output is produced when nobody is talking to HCA. In 
this state, he is simply doing his work in his study. Communicative function (CF) 
output is produced when someone is talking and/or gesturing to HCA, to which he 
responds by showing awareness of the user’s input. For this to happen in real time, the 
character module has fast-track connections to the speech and gesture recognisers in 
order to act as soon as one of them receives input. Communicative action (CA) output 
is HCA’s conversational contributions. The overall state relationships are: NCA -> 
CF <-> CA -> NCA. Thus, NCA, the system’s “resting state”, must be followed by a 
CF state in which a new user starts addressing HCA. Following the user’s first con- 
versational contribution, conversation in which user and system take turns (CF <-> 
CA) may go on for a while, eventually being followed by the NCA state. 

The mind-state agent (MSA, Figure 3) manages the user’s spoken and/or gesture 
input including the planning of which response (or communicative action) to produce 
to the input. The central module is the MSA Manager (MSAM) which manages the 
other components of the MSA. Based on proposals from the conversation intention 
planner which embodies HCA’s conversational agenda, the MSAM decides whether 
to reply to the user’s input and/or whether to take the initiative in the conversation. 
The MSAM contacts the relevant domain agents (DAs) to get a reply and/or a dialo- 
gue continuation. For replies, the knowledge base (KB) is always contacted directly. 
For continuations, the KB is contacted directly unless the proposed output is a mini- 
dialogue, i.e. a predefined small dialogue, in which case the mini-dialogue processor 
(MDP) is contacted first. The MDP processes mini-dialogues in a finite state-machine 
approach. The KB is a database which maintains the system’s ontology including 
references to all of HCA’s coordinated spoken and non-verbal output. The retrieved 
output references are sent to response generation via the MSAM and the character 
module manager. 
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HCA Character Module (CM) 




Fig. 3. HCA character module architecture. 

The emotion calculator updates HCA’s emotional state whenever the user’s input 
produces an emotion increment which makes HCA more happy, sad, or angry. The 
user model stores the information which HCA collects about the present user, i.e. age, 
gender and nationality, for use during conversation. The conversation history includes 
a comprehensive record of the conversation per input and output turn. 

The response generator receives a parameterised semantic instruction composed of 
input values, text-to-speech references and/or references to non-verbal behaviours. 
The TTS references are used to retrieve text template output with embedded start and 
end tags for non-verbal behaviours (bookmarks). Input values are inserted into the 
templates, creating a surface language string. The result is sent to the speech synthe- 
siser which synthesises the verbal output and, whenever it meets a bookmark, sends a 
message to the response generator that now the corresponding non-verbal output des- 
criptions must be sent to the animation module which takes care of the graphics out- 
put. The first NICE HCA prototype uses approx. 300 spoken utterance types and 100 
different non-verbal behaviour primitives. 

The promising results from the January 2004 user tests will be reported elsewhere. 

Acknowledgement. The work presented is being supported by the EU Human 
Language Technologies programme under contract IST-2001-35293. We gratefully 
acknowledge the support. 
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Abstract. With COSIMA®^® we demonstrate a prototype of a complex and vi- 
sionary e-procurement application. The embodied character agent named 
COSIMA is able to respect a customer's preferences and deals with him or her 
via natural speech. She expresses various emotions via mimic, gesture, com- 
bined with speech s 5 mthesis, and COSIMA is even able to consider the cus- 
tomer's emotions via mimic recognition. As first observations show, this is a 
very promising approach to improve the bargaining with the customer or the 
recommendation of products. 



1 Introduction 

Intelligent computers, which understand the user, seem like a pie in the sky. In order 
to understand the user we need intuitive user interfaces and the intelligent technology 
behind, which is able to act in a situated and personalized way. 

The complex and visionary e-procurement application COSIMA^^® is the result of 
an interdisciplinary effort between research of emotion detection via mimic and intel- 
ligent databases under additional usage of technology from the fields of natural lan- 
guage communication and embodied character agents. In cooperation with three in- 
dustrial partners (seller-side: SSI Schafer'; buyer-side: MAN Roland Druckmaschinen 
AG^; content provider: Fachverlag Walch^) we modeled a realistic use case scenario 
which takes the involved emotions into consideration. Equipped with real product 
data, based on the IT -product standards BMEcat and eCl@ss'', our emotional sales 
agent COSIMA was exhibited at the computer fair Systems 2003^ in Munich. There 
we shot a video of each visitor for a sociological analysis and meanwhile we are able, 
besides to demonstrate COSIMA®^®, to give experiences about the user acceptance of 
a computer system with emotional activity in a real world application. 



* www.ssi-schaefer.com 

2 www.man-roland.com 

3 www.walch.de 

* www.bmecat.org and www.eclass.de 
5 www.systems-world.com 
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2 COSIMA^^^ - An Emotional Sales Agent for E-procurement 

COSIMA^^® is a prototype of an emotional sales agent named COSIMA which is able 
to automate a cost intensive e-procurement process. We will give a short tour excerpt 
of the functionality of the prototype and will briefly describe involved technologies. 

COSIMA welcomes a customer by name and helps him when putting together a 
shopping cart of desired products by respecting the customer’s explicit search prefer- 
ences and also his long-term search preferences. The customer can communicate via 
graphical interface, natural language keyboard input, or via microphone. We intui- 
tively model preferences in an “I like A better than B” semantics as strict partial or- 
ders ([5]), which have been proven to be a very suitable basis for efficient usage of 
preferences within database search engines in various e-business applications ([6]). 

Depending on the content of the shopping cart and the role of the customer, 
COSIMA grants personalized discounts and bargains about the price. Thereby, tech- 
niques like up/down and cross selling are used. A first prototype using our preference 
based multi-objective bargaining component was published at the AAMAS 2002 ([3]). 
Meanwhile, our preference based bargaining component was enhanced, embedded in 
our e-procurement scenario and combined with emotion recognition via mimic. As 
described in [1] using emotions is a promising way for an improved human-computer 
interaction. Illustrated in Fig. 1 COSIMA takes the customer’s emotion into account 
during the bargaining process and is therefore able to react to his or her emotions. 




Fig. 1. The customer is smiling at the end of a successful bargaining 

Of course COSIMA can be pleased by the customer’s reaction, but otherwise can 
also become very angry when her opponent tries to cheat her. COSIMA shows her 
emotions via various mimics and gestures in combination with speech synthesis like 
"You make me angry when you are so stingy!". At the end of the dialog, COSIMA 
waves good bye to the customer and thanks the customer for the purchase or for the 
interest in her offers. 

We modeled the recognized customer emotions by means of Ekman’s six basic 
emotions ([2]). Moreover we detect the two further conditions “neutral” and “turning 
away” (Fig. 2), while the latter implies that the customer might be disinterested or 
distracted. Most commonly used approaches for visually detecting emotions are data 
driven and do not incorporate any model of the extracted information at all. That does 
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not result in high robustness regarding the tracking of the customer’s head and has no 
value for similar issues that could make use of the head’s pose and deformation. Other 
approaches additionally incorporate the facial texture and 3D information, which 
improves the tracking quality but looses real time capability. 




Fig. 2. Detected customer mimics “neutral” and “turned away” 



Making use of a point distribution model we created a face outline model with high 
performance. Its tracking is based on multiple, redundant techniques. Since each of 
them embodies special strengths with respect to certain challenges our approach turns 
out to be very robust, e.g. to different persons, various backgrounds, and to back- 
ground motion. Using optical flow, based on Open CV’, our system can cope with fast 
head movements, whereas our novel approach of intensity profiles helps us to extract 
human face contours with high accuracy. 

Furthermore, our system autonomously detects persons and starts tracking their 
heads without any human intervention. We achieve high accuracy using a boosted 
cascade of haar-like features®. Having tracked a human face successfully, our system 
is able to extract the parameters describing the facial deformation. By using those 
parameters a previously learned classifier detects the current mimic. 

The communication between the mimic recognition and the J2EE based e- 
procurement application is managed via an agent based FIPA-OS platform. The 
source code meanwhile counts more than 100.000 lines. 



3 Experiences, Summary, and Outlook 

Here, we focus on our experiences with the newly integrated mimic recognition com- 
ponent. For experiences with a speaking embodied character agent see our work [3]. 

When COSIMA recognized e.g. a laughing face of a customer she told him or her 
"You seem to be laughing". Most customers were surprised and wanted to know how 
COSIMA could know. They liked this feature and told us, that COSIMA seems to be 
interactive - it should be mentioned that some people did not like the idea of being 
observed by a computer. The main problem of the mimic recognition was that 
COSIMA did not always utter her comments within the context of conversation. The 



® www.intel.com/research/mrl/research/opencv 
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agent sometimes told the customers that they were laughing, although they were not. 
In general, customers ignored the unfitting comments of mimic recognition technol- 
ogy. This might be due to the fact that COSIMA was considered a prototype system - 
as some customers explained later. But when the agent repeated the sentence several 
times most customers reacted annoyed, e.g. one person declared angrily "I am NOT 
laughing". This example points out that mimic recognition is highly sensitive to the 
context of conversation for human interaction ([4]). So, how to place mimic recogni- 
tion within the communication with emotional embodied agents should be carefully 
considered. On the one hand the mimic recognition of an agent arouses high expecta- 
tions on the side of the customer, who thinks the agent is more interactive and seems 
to follow a strategy. On the other hand the realization that the mimic recognition 
makes no sense within the situation may cause disappointment or negative feelings. 

Summarized, for our novel interplay of e-procurement, personalization, and human- 
computer interaction technology we received lots of encouragement and also incentive 
for further improvements and enlargements. Thus, we intend to plug in and consider 
emotion recognition via speech input and also an emotional speech synthesis. We are 
already working on gesture recognition. A more powerful situation model will also be 
necessary in order to handle such complex processes adequately. 

With the appropriate hardware, e.g. screens equipped with cameras and micro- 
phones, such interplay in business applications is not only a vision and could be very 
helpful for both parties, here e.g. the seller and the sales prospect. 

Acknowledgments. This work is supported by the Bavarian Research Association 
FORSIP and by the German Research Foundation DFG. 

Avatar: copyright 2000 character design DFKI Saarbriicken, Germany. 
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Abstract. Affective computing systems consist of a varying number of compo- 
nents, such as sensors, data bases, and processing devices. So far, each system 
is set-up specifically and its architecture and configuration is as unique as its 
targeted purpose. With the increasing interest in affective computing, a gener- 
ally applicable and easily realizable approach is desired to ease and speed up 
the installation process of such systems. In this paper, a Java based framework 
is presented which uses an adaptor-oriented approach in order to address this 
problem from a data perspective. An example is given for the application of this 
proposed framework within an affective computing system. 



1 Introduction 

Establishing an experimental system for acquisition and analysis of affect related data 
is a challenging task. Setting up several sensors, collecting data across diverse plat- 
forms and converting them into various formats demanded by different analysis tools 
requires a lot of resources, manpower and time. New scenarios require new sensor 
choices, different configurations of analysis tools and altered data flows. 

In this paper, we present a framework that supports experimental set-ups for col- 
lection, analysis, and evaluation of emotions related data, which is easy to configure, 
to extend and to migrate to different environments. It allows the integration of arbi- 
trary sensors, data sources and target applications. 

In the following chapters, we first discuss an affective computing system from an 
integrative perspective and conclude requirements for an easily configurable and 
extendable sensor system. Following this, we shortly describe a concept for the reali- 
zation of such an affective computing system as well as a sample implementation. 
Finally, a picture of our current work and a short outlook on the next steps is given. 



2 Motivation and Goal 

Affective computing systems consist of a number of highly dedicated components 
which, in most of the cases, were developed as independent stand-alone applications. 



E. Andre et al. (Eds.): ADS 2004, LNAI 3068, pp. 313-318, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 




314 



S. Mader et al. 



Those components usually are designed around the task they are targeting. Decisions 
on the platform, the programming language to use, and ports and interfaces to be pro- 
vided are made mainly with the targeted problem in focus. Other considerations, like 
networking capabilities or compatibility with systems or applications not directly 
linked with the particular task were laid aside or neglected at all. 

Integrating such components into a complex system is a challenging task: the com- 
ponents provide a limited number of different interfaces; each component delivers or 
expects data in dedicated native formats; there is a huge variety in the data concerning 
their complexity, continuity, reliability, size, and transmission speed. Looking from 
the data perspective, a general approach to set-up such systems has to: 

• present and describe all kinds of data in a uniform way, so that they can be handled 
by the system in a similar way; 

• integrate the data of different sources by combining them in a sensible way; 

• store and forward the data in a standardized way, preserving their time order. 

As a solution, we propose a framework which can be considered as a system con- 
struction kit, where diverse components are wrapped into smoothly fitting building 
blocks that can be freely arranged and allow for a wide range of experiments. Our 
framework provides the infrastructure for connecting the components, an easy to use 
configuration mechanism, and unified, real-time capable data handling schemes. This 
makes the proposed framework also suitable as basis for modular, commercial solu- 
tions. 



3 Concept 

Existing components for affective computing systems often have been developed 
independently as stand-alone applications on different platforms. To allow them being 
smoothly integrated into the framework, they are extended by adaptors which provide 
the required uniformity in data representation. Moreover, the adaptors are designed to 
allow the handling of different components alike. In their entirety, these adaptors 
form an additional abstraction layer between the data sources and processing applica- 
tions. This abstraction layer accomplishes three tasks: 

1 . Abstract the data from their representation, i.e. usage of an intermediate data repre- 
sentation for data transmission and manipulation in order to be independent from 
the actual implementation details of data sources and data sinks; 

2. Abstract the data flow from its source and destination, i.e. build up complex data 
conversion and manipulation pipelines which can be re-used with different sensor 
configurations, storage- and analysis tools or end-user applications; 

3. Abstract data sources from data destinations, i.e. the actual source of the data will 
be transparent to the data destination. 

In order to support the above abstraction goals and the distribution of the data through 
the entire framework, the abstraction layer has to be able to support basic low-level 
operations on the data. These operations may also include recurring tasks like scaling, 
biasing, smoothing or re-sampling of the data. For additional flexibility, we require 
the framework to be configurable and extendable at runtime. 

Behmaram-Mosavat and Encama 9 ao [1] proposed a software framework for user- 
centered multi-modal interaction. Their framework provides multiple layers of ab- 
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straction which, besides device abstraction, are mainly focused on the support of 
high-level interaction schemes employing gestures, speech, and intuitive combina- 
tions of input modality. In contrast to their approach, we limit our framework to the 
transmission and distribution of quantitative data at a rather low semantic level. This 
simplifies the framework architecture and leaves the complexity of high-level proc- 
essing inside the relevant components. 

In 2001, Reitmayr and Schmalstieg [2] introduced the OpenTracker framework. 
This framework has been widely used in the fields of Virtual and Augmented Reality 
(AR/VR) since. Beside the abstraction of AR/VR applications from arbitrary position 
tracking devices, the OpenTracker framework also enables low-level pre-processing 
and fusion of data from different sources. However, OpenTracker is restricted to han- 
dle only a rigid data format describing position, orientation and button states of 3d- 
tracking devices, whereas our framework extends the OpenTracker concept for arbi- 
trary and generic data formats. 

Similar to OpenTracker, we choose an event-oriented approach, i.e. the data are 
handled in form of uniform Events and the components of the framework are classi- 
fied according to their roles with respect to the data into Sources, Filters and Sinks. 
Sources are entry points of data into the framework. Each associated with a data pro- 
ducing entity (e.g. sensor), they use native drivers or open interfaces for configuration 
purposes and data handling. Data received from the sensor are converted into Events 
and handed over to the framework. Beside the actual data, each Event contains an 
individual timestamp which exactly describes the time the data entered the frame- 
work. Sinks are the exit points of the framework. They are each associated with a data 
processing entity (e.g. analysis tool) and implement an interface to it. Their purpose is 
to receive the Events from the framework, unwrap the contained data, convert them 
into the native format of the associated entity and to pass them to that entity. The 
interfacing of Sources and Sinks with their associated entities is done via native driv- 
ers, provided API’s, existing communication interfaces (e.g. network, serial) or by 
direct integration for the case of self-developed entities. Filters perform supporting 
low-level operations on the transferred data, such as pre-filtering and masking opera- 
tions, decomposition and fusion of data streams, statistical operations, etc. 

diverse sensors e.g. data base 





Fig. 1. Two examples of event processing configurations. Left: Session data from diverse 
sensors (e.g. blood pressure, gaze point, face mimic) are fused in order to record the session in 
a data base. Right: The previously recorded session is replayed, the data are processed with a 
meaningful statistical function in order to perform a mn-time analysis 
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Sources, Filters and Sinks form the nodes of a data flow graph (see figure). The 
nodes are connected by directional edges which represent the routing of the data from 
their Source to the Sink. Along the graph, multiple filter nodes can be concatenated, 
and thus allow the construction of powerful filter chains out of simple basic filters. 
Passing those filter chains, the data will be manipulated according to the filters’ func- 
tionality and parameters. The shape of the data flow graph is freely configurable out 
of the set of existing source, filter and sink modules. 

Apart from specialized modules, which work on dedicated data formats, modules 
will be provided capable to work on generic data formats to perform general tasks, 
such as console monitoring, data base storage and retrieval, or network distribution. In 
order to allow this, all used data types have to be declared and named using a formal 
description. The description scheme is limited to a representative subset of basic data 
types and a simple structuring mechanism. Once described, the data formats may be 
used by Sources, Filters and Sinks. 



4 Implementation 

A prototypical system for affective user studies, which is under development at the 
Fraunhofer-lnstitute for Computer Graphics in Rostock, implements the described 
concept. It consists of the OmniRoute framework and several components for data 
acquisition, documentation and analysis. This includes the face tracking and mimic 
analysis platform FaceOme, a sensor board collecting ANS' related physiological 
parameters, and our RealEYES system for tracking of gaze- and mouse/keyboard 
data. All experiments can be documented with our logging tool EmoLogger. For the 
analysis of the recorded data sets we currently use data mining methods from the 
WEKA toolkit[3]. 

4.1 The OmniRoute Framework 

Our OmniRoute framework has been implemented using the Java programming lan- 
guage. It has been developed on top of the Apache Avalon/Excalibur framework 
which provides an easy to use mechanism to extend the framework with new source, 
sink and filter modules, and to introduce new data formats. Therefore, sources, filters, 
sinks, and data formats have been implemented in form of Avalon Components. 
Based on the extension mechanisms of Avalon/Excalibur, new components and data 
formats are announced to the framework via XML-based description files and are 
immediately available without the need of re-building the framework itself. 

In addition to dedicated source and sink modules, the framework provides a num- 
ber of auxiliary modules such as a sink module providing formatted console output 
for monitoring purposes or source and sink modules for the transmission of arbitrary 
data through a network connection. The latter two allow the framework to span over 
multiple hosts, e.g. to set-up a distributed system integrating different hardware plat- 
forms. 



* ANS - Autonomous Nervous System 
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Set-up and configuration of the data flow graph can be done on three ways: at 
compile time by hard-coding the graph into the application logic, at runtime by load- 
ing an XML-file containing the configuration, or, dynamically during runtime, by 
accessing the data flow graph directly. The syntax of the XML-configuration file and 
the construction rules for the graph roughly follow the concepts described in [1]. 

In order to allow the handling of generic data formats throughout the framework, 
we also use an XML description mechanism. For simplicity, the developed format 
description scheme is limited to a representative subset of basic data types and a sim- 
ple structuring mechanism. These format descriptions form the basis for the construc- 
tion of efficient data structures that will be instantiated during system initialization 
and may be used throughout the system at runtime. 



4.2 Components 
Data Sonrces 

Our Face©me system (formerly FAMIA [2], [5]) monitors and analyzes the facial 
mimics of the user. If a face is recognized, characteristic facial features are resolved 
and their position relative to each other constantly determined using template match- 
ing with normalized cross-correlation. The positional information about facial feature 
points (such as the eyebrows, eye comers, and the mouth) is sent out through a Face- 
MeSource module. 

RealEYES [6] is our system for software usability tests. As one key feature, it syn- 
chronously records multiple channels of the user’s interaction with the tested sce- 
nario. Beside mouse and keyboard events, it also records verbal utterances, facial 
expressions as well as the gaze position of the user and all activities on the computer 
screen. Mouse and keyboard data, as well as the actual gaze position are sent to the 
framework via the RealEYESSource module. 

EmoBoard : For our studies on emotions in humans who are interacting with com- 
puter systems we developed a small, portable sensor board which collects physiologi- 
cal data of the user. Namely, the skin conductance level, the skin temperature, and the 
heart rate are measured. An FPGA^ on the board collects the data from the sensors, 
evaluates them, and either sends them directly to a host computer or stores them in the 
on-board RAM to make it available to the host at a later time. An EmoBoardSource 
module mnning on the host computer converts the data into OmniRoute events and 
sends them out to the framework. 

Data Sinks 

The EmoLogger logging tool represents a universal data sink. It is a means to log and 
replay arbitrary data streams. For this purpose the associated EmoLoggerSink module 
can be configured to receive and store arbitrary data formats. Necessary or desirable 
meta-information, such as unit, scale, timestamp, the origin of the data, etc. can be 
obtained by the description of the incoming framework data type and by the frame- 
work configuration. Data and meta-information are stored within a mySQL database. 



^ FPGA - Field Programmable Gate Array; a programmable piece of hardware 
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5 Current Work and Outlook 

In order to integrate the above mentioned components into the framework, first source 
and sink modules as well as the according data formats have been implemented. At a 
first stage, we are able to connect the sensor board as well as the gaze tracking plat- 
form with the logging tool. Several filters have been implemented providing structural 
operations on the data, such as extraction of data components and fusion of the used 
data formats. Also, the run-time configuration of the data flow graph using XML 
configuration files and the event processing based on a number of currently fixed data 
types have been implemented. This allows us, in a first step, to easily set-up and con- 
figure recording sessions for usability tests. 

In a next step, the integration work will be continued by implementing further data 
source modules (e.g. the face mimic detector, speech input and -analysis) and a sink 
module for an AI analysis tool. This will enable us to set-up and configure variable 
experiments for the analysis of either pre-recorded or on-line sessions. A small rep- 
ertory of respective filter modules will be implemented as they are needed. 

In the near future, we will also add support for XML-configurable, generic data 
formats, which has not been realized yet. This will replace the currently used hard- 
coded data formats. 
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Abstract. We address the issue of spontaneous gesture synthesis for embodied 
conversation agents (EGAs), that is, the generation of appropriate gestures and 
their coordination with spoken utterances. After a characterization of the appli- 
cation constraints we establish the principal requirements of the gesture genera- 
tion framework. We demonstrate how these requirements can be met by formu- 
lating the gesture generation as real-time search through gesture space (actually 
gesture and facial expression) under the constraints arising from the graphical 
model of the character and the linguistic properties of the utterance. 



1 Gesture Synthesis in Embodied Conversation Agents 

The role of gestures in an embodied conversation agent (ECA) is much the same as in 
a real conversation, both complementing and supplementing speech to achieve robust 
and efficient communication [5]. There has been a significant amount of research in 
recent years into the requirements and technology underlying multimodal interfaces, in 
which speech, gestures, and other modalities, can be used in the both input and output 
channels. We concern ourselves here with the generation problem alone. Whilst a 
significant amount of effort has been placed on the multimodal generation, this has 
mostly been concerned with the problem of coordinating language (where written or 
spoken) and static graphics or 2D non-character animations. The automatic generation 
of coordinated gesture in a synthetic character has received relatively little attention. 

For research where a key issue is the grounding of the semantics of the language 
used by a user and an interface character, some attention has been successfully paid to 
the problem of generating deictic references. That is, having characters generate 
pointing references to resolve or emphasize real-world (or more typically graphical) 
entities corresponding to linguistic references made in a dialog [2]. At the same time, 
and somewhat separately from the multimodal interfaces community, a significant 
research effort in the graphics community has continued into the problems of character 
animation, for both the face and body. Flowever, most of this work has been driven by 
the desire to produce what amount to productivity tools for animators, which though a 
worthy goal in itself, falls short of the requirements of EGAs in which the actual 
classes of bodily motions and facial expressions themselves must be identified and not 
just modeled and animated. In relation to EGAs, or synthetic characters in general, the 
industrial state-of-the-art for generating spontaneous gesture is little more than the 
playing of canned biologically plausible animations with duration equal to or less than 
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the accompanying speech. One notable exception, is the work of Cassell et al [4] (see 
also, Yan [3]) in the Behavior Animation Toolkit (BEAT). BEAT was one of the first 
systematic attempts to integrate linguistic and graphics in the generation of spontane- 
ous gestures for an ECA and uses a rule-based system to implement the initial stage of 
the gesture generation pipeline to mark-up a spoken utterance with respect to body and 
facial motion. 



2 Application Context 

Our account of gesture generation is in the context of a commercially deployed web 
enabled ECA architecture comprising Lexicle CS [1]. The real-world nature of this 
solution places a number of constraints on the framework for gesture synthesis. The 
client application, by which the ECA is presented to the user, runs on standard PCs, 
over a narrowband network connection, and the server is scalable to large numbers of 
concurrent users. Low-end clients restrict the character modeling framework, pre- 
cluding techniques exclusively based on inverse kinematics. Conversely, a conse- 
quence of the data and processing requirements of the natural language understanding 
system is that it takes place on the server. The reliance of gesture generation on the 
linguistic analysis of user questions and system answers means that it most naturally 
resides as a post-processing stage of the computational linguistic pipeline on the 
server. 



3 Language Processing Framework 

Language processing in Lexicle CS is cast as standard Question-Answer system, im- 
plemented using a mixture of statistical low-level analysis techniques and hand crafted 
parser and matching algorithm. Typed user queries, in the domain of expertise of the 
ECA, are spell corrected, tokenized, tagged (for parts-of-speech), and parsed to a 
logical form capturing a shallow semantics of the question. 

The character’s range of potential responses are similarly pre-processed, and a 
match procedure between the semantic representations of the question and possible 
responses is performed in the context of the preceding dialogue between the user and 
the ECA. Gesture generation takes place on the server following the completion of 
natural language processing pipeline (i.e. after dialogue processing). Although we do 
not present the full details of the natural language processing pipeline here, this simple 
overview helps show how the input to the gesture planner (i.e. the output of the 
matcher and its subsequent dialogue processing) is not the set of strings to be spoken, 
but for each string there is an additional set of linguistic annotations as follows: 

1. phonetic properties: system responses comprise both the answers over which 
semantic matching has occurred and template-based “filler” sentences generated 
according to dialogue progression. Thus the full set of spoken responses, corre- 
sponding to the answers, and the fully expanded set of sentences generated by the 
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templates, can be pre-generated and the phoneme and token boundary timings 
passed to the gesture planner. 

2. syntactic properties: the tokenizer, tagger and parser yield the syntactic properties 
of the pre-generated sentences. Consequently a shallow syntactic analysis of the 
sentences is made available to the gesture planner, and functions for retrieving 
relevant features from this representation are implemented in the planner itself. 

3. dialogue properties: information pertaining to the dialogue is conveyed to the 
gesture planner comes in two basic forms. Firstly, there are a number of different 
states pertaining to dialogue progression. Secondly, the relationship between the 
sentences of the responses is provided, for example, whether a sentence is a filler 
corresponding to a topic change, a question, a statement, a repeated statement, or a 
reproach to the user for inappropriate behavior. Both response and individual sen- 
tence properties are conveyed to the gesture planner. 

4. semantic properties: in addition to facilitating pre-generation of the synthesized 
speech (and previewing and modification of default pronunciations), fixed re- 
sponses also allow the author of the content to further annotate the responses se- 
mantically by sense marking significant word deemed important by the matching 
algorithm. Consequently, in addition to the surface forms of the spoken responses, 
disambiguated senses, based on Wordnet [6], for verbs, nouns, adjectives and ad- 
verbs are available to the gesture planner. 



4 Language and Gesture Coordination 

Whilst the linguistic analysis at each level may be available, current theories relating 
how these factors interact with each other are at best nascent [5] and lack extensive 
empirical support. Thus any subsequent gesture generation algorithm inevitably re- 
quires us to prioritize one linguistic factor (and the associated class of gestures) over 
another. Whilst this might appear ad hoc, the very formulation of the problem, that is, 
the post-processing of spoken language in order to apply a layer of “coordinated” 
gesture, is no less artificial. Our approach to gesture generation is to postulate appro- 
priate gestures at each level of linguistic analysis and impose a fixed prioritization 
between the gesture classes. We formulate our language-gesture coordination problem 
by considering each level of analysis to provide support for classes of gesture or mo- 
tion as follows: 

1. phonetic: a phonetic analysis of an utterance is generated as a byproduct of speech 
synthesis and drives lip synchronization. Though not directly relevant to the gesture 
generation problem, we include it here both for completeness. 

2. syntactic: syntactic properties of an utterance have a particular bearing on the 
generation of beat gestures. In particular, significant linguistic features such as lin- 
guistic head of the sentence, the subject and object, the syntactic categories of the 
tokens and their composition. We consider syntactic properties to be one of the 
primary drivers for beat gestures, but no other class of gesture. 

3. lexical: emblematic gestures can be thought of as directly translatable into natural 
language equivalents, such as the shaken head for “no” or the raised hand with 
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palm facing forward for “goodbye”. The synthesis of emblems can be treated a one 
of direct correspondence of gestures for particular lexical items. 

4. dialog: character posture, and gestures associated with asking questions, are the 
only aspects of the ECA that originate from the dialog function of the character’s 
utterances. For example, whether the function of the sentence is to set the topic, re- 
iterate that the answer has been given before, provide a factual answer, or pose a 
question, can all have a bearing on the posture and base motions of the character. 

5. semantics: iconic and metaphoric gestures have their origin in the meaning of the 
character’s utterances, or more precisely, in the conceptual model of the character 
that precede the actual spoken utterance itself. Since our natural language process- 
ing framework requires the ECA’s answers (though not the “fillers”) to be sense 
marked, and the senses themselves are selected from Wordnet, icon and metaphor 
generation can be based on the selection of gestures assigned to a synset or by in- 
heritance from a gesture assigned to a synset or its hyponym. 

Gesture synthesis is cast as an under-constrained search problem. On the one hand, 
there are the linguistic constraints identified above: the base motions corresponding to 
the dialog functions of the sentences comprising the answers, the tokens for which the 
sense marking yield iconic and metaphoric gesture within their semantic proximity in 
Wordnet, the assignment of lexicalized gestures for a character, and the syntactic 
properties of the sentence (e.g. the identification of which nouns and verbs are more 
likely to have associated beat gestures). 

Further constraints apply to the motions themselves, for example, which base mo- 
tions may be blended with each other, the state chart of character postures, base and 
overlay motion compatibilities, and constraints on the temporal and spatial scaling that 
may be applied to each motion. The final constraint is timing properties of the synthe- 
sized speech itself The search operators themselves are the selection, scaling, blend- 
ing and overlaying of morph targets and motions - and there is a subsequent prob- 
abilistic filtering to control the frequency of gestures made by the ECA. Whilst this 
search-based formulation of the problem may be under-constrained, the order by 
which motions and morph targets are selected still has a significant impact on the final 
solution. 
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